Regular Expressions 2

Harold Nelson

10/29/2018

library(reticulate)
reticulate::use_python(python = '/Library/Frameworks/Python.framework/Versions/3.7/bin/python3', required = T)

Exercise

Use a regular expression to find and print all lines in the file “Crime and Punishment.txt” containing the word “CHAPTER” followed by one or more blank spaces followed by 1 or more upper case letters.

Answer

import re
hand = open('Crime and Punishment.txt',"r")
lines = 0
for line in hand:
    lines += 1
    if re.search("CHAPTER +[A-Z]+", line):
        print(line)

## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## CHAPTER VIII

print(lines)

## 21970

Follow-up

We could relax the requirement to find CHAPTER and also allow lines containg PART to qualify. The character ‘|’ signifies “or.”

Example

import re
hand = open('Crime and Punishment.txt',"r")
lines = 0
for line in hand:
    lines += 1
    if re.search("CHAPTER|PART +[A-Z]+", line):
        print(line)

## PART I
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## PART II
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## PART III
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## PART IV
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## PART V
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## PART VI
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## CHAPTER VIII

print(lines)

## 21970

Exercise

Read the file ‘mbox-short.txt’.

Find and count all lines containing the string ‘.edu’ followed by one or more blank spaces. Note: to use the period as a regular character instead of a wildcard, precede it with a backslash.

import re
hand = open('mbox-short.txt',"r")
lines = 0
for line in hand:
    if re.search("\.edu +", line):
        lines += 1
print(lines)

## 315

Exercise

Read the file ‘Crime and Punishment.txt’ and look for a question mark following a lower case letter followed by a blank spaces. Count the lines that match this pattern. Note to use the question mark as a regular character, precede it with a backslash.

Answer

import re
hand = open('Crime and Punishment.txt',"r")
lines = 0
for line in hand:
    if re.search("[a-z]\? ", line):
        lines += 1
print(lines)

## 951

Exercise

Read the files ‘mbox-short.txt’. Look for one or more digits surrounded by blankspaces. Count the lines that match this pattern. Also count the total number of times the pattern occurs.

Answer

import re
hand = open('mbox-short.txt',"r")
lines = 0
patterns = 0
for line in hand:
    matches = re.findall(" [0-9]+ ", line)
    if len(matches) > 0:
        lines += 1
        patterns += len(matches)
print(lines)

## 363

print(patterns)

## 641

More Sets of Characters

Regular expressions use a followed by a letter to donate several useful sets of characters.

\s means any white-space character.
\S means any non-white-space character
\d means any digit
\D means any non-digit
\w means any word character
\W means any non-word character

https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference is an only slightly overwhelming reference.

Required Numbers

We can use {} to indicate a required number of characters.

How do you say “exactly” three lower case letter?
- [a-z][a-z][a-z] is one way.
- [a-z]{3} is another way.
At least 3, but no more than 5 is written {3,5}.
At least 3 is written {3,}

Exercise

How would you write a regex to identify a social security number?

Answer

regex = "\\d{3}-\\d{2}-\\d{4}"
re.findall(regex,"My ssn is 999-99-9999")

## ['999-99-9999']

re.findall(regex,"My ssn is 999-99-999")

## []

More sets

Instead of just [a-z] or [A-Z] we can list specific sets such as [aeiou] or [AEIOU].

Example

import re
x = re.findall("^[AEIOU].+", "cat")
print(x)

## []

x = re.findall("^[AEIOU].+", "B cat")
print(x)

## []

x = re.findall("^[AEIOU].+", "A cat")
print(x)

## ['A cat']

Greedy and Non-Greedy

import re
heading  = '<h1>TITLE</h1>'
resg = re.findall('<.*>',heading)
print(resg)

## ['<h1>TITLE</h1>']

resng = re.findall('<.*?>',heading)
print(resng)

## ['<h1>', '</h1>']

Searching for Phone Numbers

Different patterns.

10 digits 3609569999
area code in parentheses then dash (360)956-9999
Two dash pattern 360-956-9999

Write patterns for each of these and test them

Answer

import re
ph1 = "\d{10}"
# Exactly 10 digits

ph2 = "\(\d{3}\)\d{3}-\d{4}"
# Treat parentheses as regular characters

ph3 = "\d{3}-\d{3}-\d{4}"

phnos = "6556 5152349999 (515)234-9999 515-234-9999 360-1245"

regex = ph1 + "|" + ph2 + "|"  + ph3
 
res = re.findall(regex,phnos)
print(res)

## ['5152349999', '(515)234-9999', '515-234-9999']