I did my bachelor’s degree in economics, during which I received a little bit of programming using Stata for econometrics, but this were just enough for us to be able to fit simple models; the complex procedures were done using the wizards menus. Because of this when I started to do data science using a programming language (I prefer R) it felt a little daunted. Because of that the approach I take on all my post are as if I am trying to explain to my past-self things that I should have know by then.
Regular expressions (RegExp abbreviation is used almost as synonym) are one of those things that when I saw for the first time, I didn’t care about learning deeply because I believe his usage was esoteric. It was not until recently that for a public project I was involved (github page here), that I was dared to use Regular expression to a good level. This let me to realize that RegExp is a good thing to learn early on your data science career.
RegExp in lay terms is a matching language for strings. This is not technically true, but for starters that’s good enough. The good thing about it is that it works with almost no difference across languages (R, Python, Java,.), which make it even more important to learn since you could use it across different applications. Apart from matching strings it also works for substitute based on this matching and a lot of other applications (a great post with 100 examples of RegExp) . Let’s explain RegExp with an examples:
This is the starting point. This is used to see if a string of characters is within another string. It is limited in the sense that you are not specifying the position of the matching within the strings. It is going to match the string everywhere even in places where you might not want it to, like inside a word. Now as an example let’s match “is” on several strings. I’m going to be using R but the RegExp part applies to any language.
x <- c("is", "Is", "it is red", "I don't care", "it is. is it?", "misarable")
#the grep and grepl function are use for matching in R, I prefer grepl
grepl("is", x) #t,f,t,f,t,t
## [1] TRUE FALSE TRUE FALSE TRUE TRUE
grepl("is", x,ignore.case = TRUE)
## [1] TRUE TRUE TRUE FALSE TRUE TRUE
Two important things to notice here. First the “Is” (with capital I) do not match because regular expressions are case sensitive. You can add an argument so that the matching function is no case sensitive, as done in the second function (If you don’t care about the case, another approach is to just lower every string with tolower()). Secondly “misarable” is a match since that word contains “is” within. This was an issue for me at first since I was specting to match the entire word and most of the time that’s what we want.
Also notice that the function returns a Boolean object, so that you can easily subset the table or vector that you want.
Now we are going to find only the “is” at the beginning or end of a word. To do this we need to introduce the term “anchors”. This are just a group of characters that indicates special situations for the matching mechanism. In this case we are going to use the “\\b” to indicate that it is going to be a match only if there is a space either before or after the pattern.
grepl("is", x)
## [1] TRUE FALSE TRUE FALSE TRUE TRUE
grepl("\\bis", x)
## [1] TRUE FALSE TRUE FALSE TRUE FALSE
Notice that now the string “miserable” do not match since it is in the middle of the word
A thing that was difficult to grasp as first about the RegExp was that these anchors must be inside the pattern that you would like to match, within the parenthesis; in our example the pattern to match is “\\bis”. The RegExp will read the first 3 characters of the pattern as an anchor. If you would like to match a pattern exactly like “\\bis” (or any other anchor) you will need to put two backslash (“\\”)before the anchor you want to escape. For example:
y <- c("\\bis", "example \\bis", "bis")
grepl("\\\\bis", y)
## [1] TRUE TRUE FALSE
Remember that the two slashes works for escaping any anchor character.
Now we are going to match when the strings begins the string with “is”. The anchor for matching the beginning of a string is “^”, therefore for matching “is” at the beginning the pattern is “^is”.
is <- c("is as a word", "is", "isle", "this is not a match")
grepl("^is", is)
## [1] TRUE TRUE TRUE FALSE
Notice that “^is” pattern returns TRUE to the “isle” string. Now the anchor for the end of the string is “\("**, and should be put at the end of the pattern like **"is\)”:
is <- c("is as a word", "is", "it is", "this is not a match")
grepl("is$", is)
## [1] FALSE TRUE TRUE FALSE
Notice that the string “is” brings a match for both the beginning and the end of the string.
This is just an introduction to RegExp; my objective been to let you know that this tool is available in any language and give an explanation from non-programmer view. There are a lot of tutorial over the internet and maybe better than this one, so feel free to search for other resources.