Regular Expressions

Harold Nelson

4/12/2018

library(reticulate)

Regular Expressions

Regular expressions define a language of patterns for working with strings. To use regular expressions, you must import the module re.

A quoted string is a regular expression of the simplest type. We can ask if a string contains another particular string somewhere within it. The function in re which does this is search(). The first argument of search is the string we are searching for. The second argument is the string within which we are searching. Search returns a match object, which can be used with if as though it were a boolean value True or False.

Example

import re
st = "Tom and Jerry"
if re.search("om",st):
    print("Yes")
else:
    print("No")
    
## Yes

Exercise

Use re.search() to search for the following patterns in st. Use the code above.

Answer

if re.search("re",st):
    print("Yes")
else:
    print("No")
    
## No
if re.search("m ",st):
    print("Yes")
else:
    print("No")  
    
## Yes
if re.search("m  ",st):
    print("Yes")
else:
    print("No")    
## No

Exercise

Write a little snippet of code to see what kind of object re.search() returns. It appears to be boolean. Is it? What does it return if the search succeeds? What if it fails?

Answer

print("Success")
## Success
print(" ")
res = re.search("m ",st)
print(res)
## <_sre.SRE_Match object; span=(2, 4), match='m '>
print(type(res))
## <class '_sre.SRE_Match'>
print(" ")
print("Failure")
## Failure
res = re.search("x ",st)
print(res)
## None
print(type(res))
## <class 'NoneType'>

Patterns Beyond Strings

So far, we have not expanded on what we could have done with simple string methods.

Regular expressions allow us to be both very demanding and very flexible.

A simple example of being more demanding is that the pattern must be found at the very beginning or very end of the string.

Beginning and End

The special symbol “^” in a regular expression indicates the beginning of the string. The special symbol ‘$’ indicates the end of the string.

Examples

if re.search("er",st):
    print("Yes")
else:
    print("No")
    
 
## Yes
if re.search("^er",st):
    print("Yes")
else:
    print("No")  
    
  
## No
if re.search("er$",st):
    print("Yes")
else:
    print("No") 
## No

Doodle

Make up an example of your own to demonstrate success and failure with the requirement to be at the beginning or end of the string.

The “Any” Character.

You remember the joke about the novice who couldn’t find the any key on his keyboard. In regular expressions, there is an “any’ character, the simple period or dot, ‘.’.

It requires/allows for a specific number of any characters in the string for which you are searching. The number of characters must match the count of dots exactly.

Examples

if re.search("J...y$",st):
    print("Yes")
else:
    print("No") 
## Yes
if re.search("J..y$",st):
    print("Yes")
else:
    print("No") 
## No
if re.search("J....y$",st):
    print("Yes")
else:
    print("No") 
## No

Doodle

Make up your own examples to demonstrate matching an exact number of any characters. Show both success and failure.

One/Zero or More

To allow for one or more of any characters place a “+” to the right of the dot.

To allow for Zero or more of any characters, palce an “*" to the right of the dot.

Examples

if re.search("J.+y$",st):
    print("Yes")
else:
    print("No") 
## Yes
if re.search("J.*y$",st):
    print("Yes")
else:
    print("No") 
## Yes
if re.search(".+T$",st):
    print("Yes")
else:
    print("No") 
## No
if re.search(".*T",st):
    print("Yes")
else:
    print("No") 
## Yes
if re.search("y.*",st):
    print("Yes")
else:
    print("No") 
## Yes
if re.search("y.+",st):
    print("Yes")
else:
    print("No") 
## No

Doodle

Make up your own examples showing both success and failure with * and +.

Characters in a Subset

We can specify that at sum position there must be a character from a specific subset of characters. We place abbreviations for these subsets enclosed in square brackets. These may be modified with + or *.

We can combine these to indicate that members of more than one subset would be acceptabl. For example [a-z0-9] or [a-zA-Z]. Examples

Examples

if re.search("J[a-z]*$",st):
    print("Yes")
else:
    print("No") 
## Yes
if re.search("J[0-9]*$",st):
    print("Yes")
else:
    print("No") 
## No
if re.search("J[0-9]*$",st):
    print("Yes")
else:
    print("No") 
## No

Doodle

Make up your own examples showing both success and failure with the use of these subsets.

Extracting

We canuse re.findal() to get a list of the matched strings found.

Examples

l1 = re.findall('[A-Z][a-z]+',st)
print(l1)
## ['Tom', 'Jerry']

Doodle

Make up your own examples demonstrating the use of re.findall().

Creating Regular Expressions

Write a regular expression that requires an upper case letter at the beginning of a string followed by three digits and one or more lower case letters. Create a string which matches this pattern and use re.search() to verify it.

Answer

ex = "H123abcde  and more junk"
if re.search("[A-Z][0-9][0-9][0-9][a-z]+",ex):
    print("Yes")
else:
    print("No") 
## Yes

Exercise

Now change the string so that it does not match and verify the failure.

Answer

ex = "H12abcde  and more junk"
if re.search("[A-Z][0-9][0-9][0-9][a-z]+",ex):
    print("Yes")
else:
    print("No") 
## No

Exercise

Return to the example which matches and use re.findall() to extract the matching part of the string and print it.

Answer

ex = "H123abcde  and more junk"
l1 = re.findall("[A-Z][0-9][0-9][0-9][a-z]+",ex)
print(l1) 
## ['H123abcde']