Regular Expression

Last time, we learned some basic functions from the stringr package for handling and working with text in R. But in this course, we want to unleash the power of strings manipulation. So we are going to learn about regular expressions.

What are Regular Expressions?

The name “Regular Expression” does not say much. However, regular expressions are all about text. Think about how much text is all around us in our modern digital world: emails, text messages, news articles, blogs, comments, tweets—all these things are text. Regular expressions are a tool that allows us to work with these text data by describing text patterns.

A regular expression is a special string for describing a certain text pattern. In other words, a regular expression is a set of symbols that describes a set of strings. Because the term “regular expression” is rather long, most people use the word regex as a shortcut term.

It is worth noting what regular expressions are NOT. They’re NOT a programming language. They may look like some sort of programming language because they are a formal language with a defined set of rules that makes a computer do what we want it to do. However, there are no variables in regex and you can’t do computations like adding 2 + 2.

What are Regular Expressions (RegEx) used for?

We use regex to work with text. You could use regex to search a document for a word, center, spelt either as “c e n t e r” or “c e n t r e”. You could search a document and replace all occurrences of “Korea, South”, “Republic of Korea”, or “R.O.K.” with “South Korea”.

Consider the second and third problems we had. Our text document from NAVER news contains a lot of punctuation marks, numbers, & alphabets, which we may want to remove. How can we detect and extract them from the document?

We may use regex to detect such characters. For example, if we want to detect and extract a percentage term “16%” from the document, we can do so by the following R input: str_extract_all(text_trim, "16%")

However, the text pattern is very specific, not generalizable. We need to remove all percentage terms together at once because the document contains a variety of percentage terms. When the size of text become large, it is hardly possible to specify all the terms we want to remove. But using regex, we can describe what we are looking for in text. In the case of NAVER News, we can detect and extract any term consisting of digit numbers followed by percentage without having to specify what we are looking for. Once we define a pattern of regex, then RStudio will return matching results.

Before getting into regex

Regular expressions may seem difficult to understand at first. You will see strings with a bunch of letters, digits, and other punctuation symbols combined in nonsensical ways. Like programming and data analysis, learning regular expressions and becoming fluent in defining regex patterns takes time and requires a lot of practice. But the more you practice, the better you will become fluent in defining more complex patterns and getting the most out of them. And regex is supported by most of other programming languages like python, perl, and Java!!

Regex Basics

Our purpose of working with regex is to describe certain patterns that match against text strings. That is to say, working with regex is all about pattern matching. the result of a match is either successful or not. So, as long as you specify a text pattern you want to detect, RStudio will return characters (or strings) that match the pattern.

As mentioned above, the simplest version of pattern matching is to search for any occurrences of some specific characters in a string. For example, we searched for the term “16%” in a text document from the NAVER news text.

But we may need to form a regex pattern with a complex structure; for example, what if we want to match against all words starting with a number and ending with %? In such cases we construct regex much in the same form of arithmetic expressions.

Matching Literal Characters

Let’s begin with the simplest match of all: a literal character. A literal character match is that a given character such as the letter "A" matches the letter A. This is why it is called literal as it matches itself. This type of match is the most basic type of regex operation: just matching plain text with quotes.

Here’s some basic examples for your understanding of regex.

The first regex we work with is "아이폰". This pattern is formed by the letter “아”, followed by the letter “이”, and ending with the letter “폰”. But this pattern matches not only the word 아이폰 but also the words 아이폰12, 아이폰4, 아이폰5. So our regex pattern should begin and end with ‘blank’: " 아이폰 "

Consider the string object: text_trim

To have a visual representation of the actual pattern that is matched to the string object, we can use the function str_view_all() in the package stringr:

load("text_trim.RData")

library(stringr)
text_trim
## [1] "[이데일리 장영은 기자] ‘아이폰12’(가칭) 이번주에 드디어 베일을 벗는다. 예년보다 아이폰 신작의 공개 및 출시 일정이 한달 가량 늦어지면서 ‘애플 팬’들의 기대감은 더 높아져 있는 상태다. 아이폰 12 시리즈는 △아이폰12 미니(5.4인치) △아이폰12(6.1인치) △아이폰12 프로(6.1인치) △아이폰12 프로 맥스(6.7인치) 등 4종으로 출시될 예정이다. 역대 라인업 중 가장 많은 모델로 구성된다. 우선 가장 크게 주목을 받고 있는 부분은 애플의 첫번째 5G폰이라는 점이다. 삼성전자(005930)가 세계 최초 5G폰인 ‘갤럭시S10’을 출시한지 1년 7개월만이다. 그동안 업계에서는 ‘완벽하지 않으면 내지 않는다’는 애플의 철학 때문에 5G폰 출시가 늦어지는 것이라는 해석도 나왔다. 국내에 출시되는 아이폰12 시리즈는 4종 모두 6기가헤르츠(GHz) 대역 이하(서브6) 5G 모델로 출시될 것으로 알려졌다. 다만, 최상위 모델인 ‘아이폰12 프로 맥스’의 미국 출시 모델에는 28Ghz의 초고주파(mmWave) 대역 안테나가 탑재될 것이란 전망이 나온다. 배터리 절감 차원에서 4G와 5G를 선택할 수 있는 스마트 데이터모드가 도입된다. 애플의 최신형 칩셋 ‘A14 바이오닉’이 보여줄 강력한 성능에도 이목이 쏠린다. 애플은 자체 어플리케이션 프로세서(AP)의 성능이 동급 최강이라는 자부하고 있다. 애플 최초로 ‘5나노미터’ 공정 기술이 적용된 A14는 A13보다 중앙처리장치(CPU)와 그래픽처리장치(GPU)의 속도를 각각 16%, 8.3% 향상시킬 것으로 예상되고 있다. A12와 비교하면 CPU 속도는 40%, GPU는 30% 각각 향상된다. 디자인 측면에서는 과거로 회귀할 것이라는 전망이다. 기기의 가장자리에 ‘깻잎 통조림’이라는 별칭을 얻었던 ‘아이폰4’와 ‘아이폰5’ 처럼 평평한 금속 테두리를 적용할 것으로 알려졌다. 전작의 인기 색상이었던 ‘미드나잇 그린’을 대체할 색상으로 다크 블루 색상을 채택하고 액정을 보호하기 위한 ‘세라믹 쉴드’ 코딩도 새롭게 적용된다. 이밖에도 아이폰12에서는 충전방식이 라이트닝에서 USB-C 타입으로 바뀌고, 원가절감과 환경 보호를 위해 기본 구성품으로 제공되던 유선이어폰(이어팟)과 충전기가 빠진다. 아이폰12는 미국을 포함한 1차 출시국에 이르면 23일께 공식 출시될 것으로 예상된다. 아이폰12 프로 맥스는 부품 수급 등 문제로 11월에 따로 출시될 가능성도 있다. 특히 올해는 아이폰 신작의 국내 출시 일정이 획기적으로 앞당겨질 것이라는 소식이다. 첫 5G폰인 만큼 세계 최초로 5G 서비스를 상용화한 국내에 1차 출시국에 준하는 일정으로 이달 말께 선보인다는 것이다. 가격은 가장 저렴한 ‘아이폰12 미니’ 기준으로 649달러부터 749달러까지 다양한 관측이 나오고 있는 상황이다. 업계에서는 아이폰12이 아이폰 사용자들의 교체 수요를 자극해 예년보다 많이 판매될 것으로 예상하고 있다. 전 세계에서 사용되는 아이폰 약 9억5000만대 중 3억5000만대가 1년 안에 교체될 가능성이 높은 구형 제품인데다, 애플의 첫 5G폰이기 때문이다. 또 미국 중고 휴대폰 셀셀닷컴이 미국 안드로이드 스마트폰 사용자 2000명을 대상으로 실시한 설문조사에 따르면 응답자의 33%가 아이폰12로 교체를 희망하는 것으로 나타났다. 한편, 애플은 오는 13일(현지시간) 미국 샌프란시스코 본사에서 온라인으로 아이폰12 공개 행사를 진행할 계획이다. 행사는 애플 홈페이지를 통해 실시간으로 시청할 수 있다. 장영은 (bluerain@edaily.co.kr)"
str_view_all(text_trim, " 아이폰 ") # string name comes first and specified pattern of regex follows

This may seem simple but there are a couple details to be highlighted. The first is that regex searches are case sensitive. This means that the pattern "5g" would not match “5G” in text_trim.

str_view_all(text_trim, "5g") # regex is case sensitive, so it does not match anything

Second thing is that regex counts a blank as a character: Blanks are considered literal characters. Let’s test the pattern " 아이폰 "

str_view_all(text_trim, " 아이폰  ") # It differntiates " 아이폰 " from " 아이폰  " by ending with two blanks in regex 

Metacharacters

Now, we are going to learn about metacharacters. The most basic type of regex is the literal characters that match themselves. But not all characters match themselves. Any character that does not match itself is a metacharacter. This type of characters has a special meaning and they allow us to transform literal characters in very powerful ways.

Here’s the list of 15 metacharacters in regex.

Throughout this course, we are going to work with these metacharacters. Actually, what we need to know about regex is how these metacharacters work. Fortunately, there are only a few metacharacters to learn. Unfortunately, some metacharacters have more than one meaning. The meaning of the metacharacters depend on the context in which we use them, how we use them, and where we use them. So learning those meanings may take time and requires hours of practice.

The Wild Metacharacter, the dot

The first metacharacter we learn about is the dot or period ".", better known as the wild metacharacter. This metacharacter is used to match ANY character except for a new line.

For example, consider a pattern "..폰". This pattern will match not only 아이폰, but also 5G폰, 휴대폰, and so on. But it will not match 스마트폰, because the dot only matches one single character.

str_extract_all(text_trim, "..폰")
## [[1]]
##  [1] "아이폰" "아이폰" "아이폰" "아이폰" "아이폰" "아이폰" "아이폰" "5G폰"  
##  [9] "5G폰"   "5G폰"   "아이폰" "아이폰" "아이폰" "아이폰" "아이폰" "이어폰"
## [17] "아이폰" "아이폰" "아이폰" "5G폰"   "아이폰" "아이폰" "아이폰" "아이폰"
## [25] "5G폰"   "휴대폰" "마트폰" "아이폰" "아이폰"

Escaping metacharacters, the baskslash (or Korean won sign)

How can we match the character dot instead of the metacharacter, then? For instance, say we have the following character vector:

dot_words <- c("e.g", "eng", "e g", "e-g")
dot_words
## [1] "e.g" "eng" "e g" "e-g"

If we try the pattern "e.g", it will match all of the elements in dot_words.

str_view_all(dot_words, "e.g")

To actually match the dot character, what we need to do is to escape the metacharacter. In most languages, the way to escape a metacharacter is by adding a backslash character in front of the metacharacter: "\.". When we put a backslash in front of a metacharacter, we are escaping the metacharacter, this means that the character no longer has a special meaning, and it will match itself.

However, R is a bit different. Instead of putting a single backslash, we should put double backslashes: "e\\.g". This is because the backslash "\" is another metacharacter so it has a special meaning in R too.

str_view_all(dot_words, "e\\.g")

Regex practice

So far, we have learned about metacharacters and how to escape the metacharacters. From now on, we will learn more about metacharacters and the opening and closing brackets [ ], used for defining a character set.

Character sets

A character set matches any of the various characters that are inside the set: i.e., "[abc]" will match the characters, “a”, “b”, or “c”, in the text. The square brackets [ ] indicate the character set.

Note that the order of the characters inside the character set does NOT matter; what matter is the presence of the characters inside the brackets. So, the character set "[abc]" will match any lower-case letters, “c”, “b”, or “a” in the text. And "[cba]" will do the same thing.

Defining character sets

Consider a regex pattern that includes a character set of vowels: "f[aeiou]n", and a vector with the words “fan”, “fin”, “fun”

library(stringr)
fns <- c("fan","fen","fin","fon","fun")
str_view_all(fns, "f[aeiou]n")

The set “f[aeiou]n” matches all elements in fns. Now let’s use the same set with another vector fnx:

fnx <- c("fan","fin","fun","f0n","f.n","f1n","fain")
str_extract_all(fnx, "f[aeiou][aeiou]n")
## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)
## 
## [[7]]
## [1] "fain"

As you can see, only the first three elements with vowel letters in fnx are matched. And the last element “fain” was is not matched. The character set matches only one character, either “a” or “i” but not “ai”.

Character ranges

The above character set specifies possible characters we want to match against. But what if we want to match any letter in English alphabet, either upper-case or lower-case, or any digit?

Character ranges help us solve this problem: we have a convenient shortcut based on the hyphen metacharacter "-" to indicate a range of characters. A character range consists of a character set with two characters separated by a hyphen "-" sign.

So, to match any letter or number, we can define a character set formed as:

uppercase <- "[A-Z]"

lowercase <- "[a-z]"

number <- "[0-9]"

korean letter <- "[가-힣]"

Note that the hyphen is only a metacharacter when it is inside a character set; outside the character set it is just a literal hyphen.

How, then, do we use the character range? Let’s see the following vector with triplet strings and match various occurrences of a certain type of character.

triplets <- c("bts","the","BTS","The","010","070","아이폰","휴대폰",":-)","^^;")
str_view_all(triplets, "[a-z][a-z][a-z]") # any three consecutive lower-case letters
str_view_all(triplets, "[A-Z][A-Z][A-Z]") # any three consecutive upper-case letters
str_view_all(triplets, "[A-Z][a-z][a-z]") # any upper case letter first, followed by any two lower-case letters
str_view_all(triplets, "[가-힣][가-힣][가-힣]")
str_view_all(triplets, "[0-9][0-9][0-9]") # any numbers with three consecutive digits

Note that the elements ":-)" and "^^;" are not matched by any of the character ranges that we have seen so far.

Repetition

We can control how many times a pattern matches with the repetition operaters: {n}: exactly n times {n,}: n times or more {n,m}: between n and m times ?: 0 or 1 +: 1 or more *: 0 or more

str_view_all(triplets, "[a-z]{3}") # any three consecutive lower-case letters
str_view_all(triplets, "[A-Z]{2,}") # any upper-case letters repeats 2 times or more
str_view_all(triplets, "[A-Z][a-z]+") # any upper case letter first, followed by any lower-case letters
str_view_all(triplets, "[가-힣]+")
str_view_all(triplets, "[0-9]+") # any numbers with one ore more digits

Negative character sets

When working with regex, we will have a frequent situation to match characters that are NOT part of a certain set. For example, we may want to match any character that is not part of alphabet. This type of matching can be done using a negative character set to match any one character that is not in the set. To define this type of sets, we use the metacharacter caret "^".

The caret "^" is one of the metacharacters that have more than one meaning depending on where it appears in a regex pattern. If we use a caret in the first position inside a character set, i.e. "[^a-z]", it means negation to indicate “not any one of the following lower-case letters.” So it matches anything except lower-case letters.

So, we can match the elements ":-)" and "^^;", which are neither letter nor numbers, by defining a negative character range "[^a-zA-Z0-9]"

str_view_all(triplets, "[^a-zA-Z가-힣0-9]{3}") # three consecutive negations of letters & digits

It is important to note that the caret means negation only when it comes the first inside the character set, otherwise the set is not a negative one:

str_view_all(triplets, "[a-zA-Z가-힣0-9^]{3}") # three consecutive letters/numbers/caret

In this case, the pattern "[a-zA-Z가-힣0-9^]" means “any one letter or number or caret character,” which is completely different from the negative set "[^a-zA-Z가-힣0-9]" that negates any one letter/number.

How can we match the literal character ^ in the last element of triplets without a character set? Use double backslahses!

If we want to match any character except the caret, then we need to use a character set with two carets: "[^^]". The first caret works as a negative operator, the second caret is the caret character itself:

str_view_all(triplets, "\\^\\^;")
str_view_all(triplets, "[^^][^^][^^]") # three consecutive negations of caret

Metacharacters inside character sets

Now we know what character sets are, how to define character ranges, and how to specify negative character sets. From now on, let’s talk about what happens when including metacharacters inside character sets.

Except for the caret in the first position, any other metacharacter inside a character set is already ESCAPED!!. This means that we do not need to escape them using double backslashes inside the character set.

Consider the vector of words fnx for example. A regex with the character set formed by "f[.aiu]n" includes the dot character. And remember that the dot character is a metacharacter, in general, which matches any type of character. However, when the dot character is inside a character set, it loses its function as a metacharacter. So the character set only matches letters “a”, “i”, “u”, and the literal dot character “.” between “f” and “n”.

fnx
## [1] "fan"  "fin"  "fun"  "f0n"  "f.n"  "f1n"  "fain"
str_view_all(fnx, "f[.aiu]n") # three consecutive letters "a"/"i"/"u" or the literal dot character

Unfortunately, not all metacharacters become literal characters when they are inside a character set. There are some exceptions: the closing bracket ] and the hyphen -, as well as the caret ^.

The closing bracket ] is used to enclose the character set. So, when we want to use a literal closing bracket inside a character set, we should escape it using double backslashes "[aiu\\]]".

As we’ve already seen, the hyphen character - is used to define a range of characters inside a character set: i.e. [a-d] and [0-5]. By the same token, we can match a literal hyphen inside a character set like: "[a\\-z]".

escape <- c("f^n","f]n","f-n") # We need a regex pattern to match these character patterns
str_view_all(escape, "f[\\]\\-\\^]n") # Different metacharacters can be escaped by putting double backslashes coming first