July 9, 2020

Motivation

  • Given a task that extracting the necessary information to generate the tables from the Mplus (or SAS or etc) output. Furthermore, by coding, we can guarantee the consistency and reproducible manner of the numbers from the output in the tables.

  • This is a very quick introduction re-written from Prof. Roger Peng's lectures to the general notion of regular expressions and how it can be used to process text. For more information, we should refer to the website.

What are Regular Expressions?

  • A tool for searching and matching parts of a text by describing the patterns that should be used to identify those parts

  • Regular expressions (called Regex for short) can be thought of as a combination of literals and metacharacters

  • To draw an analogy with natural language, think of literal text forming the words of this language, and the metacharacters defining its grammar.

  • Regular expressions have a rich set of metacharacters

Usage Examples

  • Test if a credit card number has the correct number of digits

  • Test if an email address is in a valid format

  • Search a document for either "color" or "colour"

  • Replace all occurrences of "Bob", "Bobby", or "B." with "Robert"

  • Count how many times "training" is preceded by "computer", "video", or "online"

Literal Strings

Simplest pattern consists only of literals. The literal "nuclear" would match to the following lines:

Ooh. I just learned that to keep myself alive after a nuclear blast! All I have to do is milk some rats then drink the milk. Aweosme. :}

Laozi says nuclear weapons are mas macho

Chaos in a country that has nuclear weapons – not good.

my nephew is trying to teach mea nuclear physics, or possibly just trying to show me how smart he is so I'll be proud of him [which I am]

lol if you ever say "nuclear" people immediately think DEATH by radiation LOL

Literal Strings

The literal "Obama" would match to the following lines

Politics r dum. Not 2 long ago Clinton was sayin Obama was crap n now shw sez vote 4 him n unite? WTF?

Screw em both + Mcain. Go Ron Paul!

Clinton conceeds to Obama but will her followers listen??

Are we sure Chelsea didn't vote for Obama?

thinking … Michelle Obama is terrific!

jetlag..no sleep…earyly mornig to starbux..Ms. Obama was moving

Regular Expressions

  • Simplest pattern consists only of literals; a match occurs if the sequence of literals occurs anywhere in the text being tested
  • What if we only want the word "Obama"? or sentences that end in the word "Clinton", or "clinton" or "clinto"?

\(\Rightarrow\) we got a class of words (not a single word), so we want to represent that class by a sequence of expressions: regular expressions

Regular Expressions

We need a way to express

  • whitespace word boundaries
  • sets of literals
  • the beginning and end of a line
  • alternatives ("war" or "peace")

\(\Rightarrow\) Metacharacters to the rescue!

  • Characters with special meaning
  • Transform literal characters into powerful expressions
  • Only a few to learn
  •  . * + - {} [] ^ $ | ? () : ! =
  • Can have more than one meaning:
    * wildcard (.)
    * escaping (\ )
    * spaces ( )
    * tabs (\t )
    * line returns (\r, \n, \r\n)

Metacharacters

Some metacharacters represent the start of a line (^: caret sign)

^i think

will match the lines

i think we all rule for participating

i think i have been outed

i think this will be quite fun actually

i think i need to go to work

i think i first saw zombo in 1999.

Metacharacters

$ represents the end of a line

morning$

will match the lines

well they had something this morning

then had to catch a tram home in the morning
dog obedience school in the morning

and yes happy birthday i forgot to say it earlier this morning

I walked in the rain this morning

good morning

Metacharacters

None-whitespaces and whitespaces

\\S+ and \\s+

matches

[A-Za-z]+ \(\rightarrow\) one or more letters (replace with \\S+ to match 1 or more non-whitespaces)

[[:space:]]+ \(\rightarrow\) 1+ whitespaces (or \\s+ will match 1 or more whitespaces)

Character Classes with [ ]

We can list a set of characters we will accept at a given point in the match

[Bb][Uu][Ss][Hh]

will match the lines

The democrats are playing, "Name the worst thing about Bush!"

I smelled the desert creosote bush, brownies, BBQ chicken

BBQ and bushwalking at Molonglo Gorge

Bush TOLD you that North Korea is part of the Axis of Evil

I'm listening to Bush - Hurricance (Album Version)

Character Classes with [ ]

^[Ii] am

will match

i am so angry at my boyfriend i can't even bear to look at him

i am boycotting the apply store

I am twittering from iPhone

I am a very vengeful person when you ruin my sweetheart.

I am so over this. I need food. Mmmm bacon…

Character Classes with [ ]

Similarly, you can specify a range of letters [a-z] or [a-zA-Z]; notice that the order doesn't matter

^[0-9][a-zA-Z]

will match the lines

7th inning stretch

2nd half soon to begin. OSU did just win something

3am - cant sleep - too hot still.. :(

5ft 7 sent from heaven

1st sign of starvagtion

Character Classes with [ ]

When used at the beginning of a character class, the "^" is also a metacharacter and indicates matching characters NOT in the indicated class

[^?.]$

will match the lines

i like basketballs

6 and 9

dont worry… we all die anyway!

Not in Baghdad

helicopter under water? hmmm

More Metacharacters

"." is used to refer to any character. So

9.11

will match the lines

its stupid the post 9-11 rules

if any 1 of us did 9/11 we would have been caught in days.

NetBios: scanning ip 203.169.114.66

Front Door 9:11:46 AM

Sings: 0118999881999119725…3 ! ##\(\textbf{it can match nothing}\)

More Metacharacters:

This does not mean "pipe" in the context of regular expressions; instead it translates to "or"; we can use it to combine two expressions, the subexpressions being called alternatives

flood|fire

will match the lines

is firewire like usb on none macs?

the global flood makes senese within the context of the bible

year ive had the fire on tonight

… and the floods, hurricances, killer heatwaves, rednecks, gun nuts, etc.

More Metacharacters:

We can include any number of alternatives…

flood|earthquake|hurricane|coldfire

will match the lines

Not a whole lot of hurricances in the Arctic.

We do have earthquakes nearly every day somewhere in our State hurricanes swirl in the orther direction

coldfire is STRAIGHT!

'cause we keep getting earthquakes

More Metacharacters:

The alternatives can be real expressions and not just literals

^[Gg]ood|[Bb]ad

will match the lines

good to hear some good knews from someone here

Good afternoon fellow american infidels!

good on you-what do you drive?

Katie… guess they had bad experiences…

my middle name is trouble, Miss Bad News

More Metacharacters: (and)

Subexpressions are often contained in parentheses to constrain the alternatives (refer the previous slide when match with ^[Gg]ood|[Bb]ad )

^([Gg]ood|[Bb]ad)

bad habbit

bad coordination today

good, becuase there is nothing worse than a man in kinky underwear

Badcop, its because people want to use drugs

Good Monday Holiday

Good riddance to Limey

More Metacharacters: ?

The question mark indicates that the indicated expression is optional

[Gg]eorge( [Ww]\.)? [Bb]ush

will match the lines

i bet i can spell better than you and george bush combined

BBC reported that President George W. Bush claimed God told him invade

a bird in the hand is worth two george bushes

(\ (backslash) : sometimes it was called by 'escaping character' because in this situation . was the literal character, not the metacharacter, so we don't want to misinterpret it)

One thing to note…

In the following

[Gg]eorge( [Ww]\.)? [Bb]ush

we wanted to match a "." as literal period; to do that, we had to "escape" the metacharacter, preceding it with a backslash. In general, we have to do this for any metacharacter we want to include in our match

More metacharacters: * and +

The * and + signs are metacharacters used to indicate repetition; * means "any number, including none, of the item" and + means "at least one of the item"

(.*)

will match the lines

anyone wanna chat? (24, m, germany)

hello, 20.m here… ( east area + drives + webcam )

(he means older men)

()

More metacharacters: * and +

The * and + signs are metacharacters used to indicate repetition; * means "any number, including none, of the item" and + means "at least one of the item"

[0-9]+ (.*)[0-9]+

will match the lines

working as MP here 720 MP battallion, 42nd birgade

so say 2 or 3 years at colleage and 4 at uni makes us 23 when

it went down on several occasions for like, 3 or 4 *day*

Mmmm its time 4 me 2 go 2 bed

More metacharacters: { and }

{ and } are referred to as interval quantifiers; they let us specify the minimum and maximum number of matches of an expression

[Bb]ush( +[^ ]+ +){1,5} debate

#that whole phrase be repeated 1 to 5 times

will match the lines

Bush has historically won all major debates he's done.

in my view, Bush doesn't need these debates..

bush doesn't need to debates? maybe you are right

That's what Bush supporters are doing about the debate.

Felix, I don't disagree that Bush was poorly prepared for the debate.

indeed, but still, Bush should have taken the debate more seriously.

Keep repeating that Bush smirked and scowled during the debate

More characters: { and }

  • m,n means at least m but not more than n matches
  • n means exactly m matches
  • m, means at least m matches

More characters: ( and ) revisited

  • In most implementations of regular expressions, the parentheses not only limit the scope of alternatives divided by a "|", but also can be used to "remember" text matched by the subexpression enclosed
  • We refer to the matched text with \1, \2, etc,

So the expression

 +([a-zA-Z]+) +\1 +

will match the lines

time for bed, night night twitter!

blah blah blad blah

my tattoo is so so itchy today

i was standing all all alone against the world outside…

hi anybody anybody at home

estudiando css css css css…. que desastritooooo

More metacharacters: ( and ) revisited

The * is "greedy" so it always matches the longest possible string that satisfies the regular expression. So

^s(.*)s

matchs

sitting at starbucks

setting up mysql and rails

studying stuff for the exams

spaghetti with marshmallows

stop fighting with crackers

sore shoulders, stupid ergonomics

The greediness of * can be turned off with the ?, as in

^s(.*?)s

which tells R does the lazy one, find the first one match.

Regular Expression Functions

The primary R functions for dealing with regular expressions are

  • grep() (g/re/p: Global regular expression print), grepl(): search for matches of a regular expression/pattern in a character vector; grep() returns the indices into the character vector that contain a match or the specific strings that match. grepl() returns a TRUE/FALSE vector indicating which elements match

  • regexpr(), gregexpr(): search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match

  • sub(), gsub(): search a character vector for regular expression matches and replace that match with another string

  • regexec(): searches a character vector for a regular expression, much like regexpr(), but it will additionally return the locations of any parenthesized sub-expressions

  • regmatches: one handy function which extracts the matches in the strings without having to use substr.

Some limitations of grep

  • The grep function tells you which strings in a character vector match a certain pattern but it doesn't tell you exactly where the match occurs or what the match is (for more complicated regex)

  • The regexpr function gives you the index into each string where the match begins and the length of the match for that string.

  • regexpr only gives you the first match of the string (reading left to right). gregexpr will give you all the matches in a given string.

Regular Expression Engines

Application

Summary

  • Regular expressions are used in many different languages; not unique to R.
  • Regular expressions are composed of literals and metacharacters that represent sets or classes of characters/words
  • Text processing via regular expressions is a very powerful way to extract data from "unfriendly" sources (not all data comes as a CSV file)
  • The primary R functions for dealing with regular expressions are: grep, grepl, regexpr, gregexpr, sub, gsub, and regexec.

References