Regular Expressions 2

Harold Nelson

4/17/2018

library(reticulate)

Exercise

Use a regular expression to find and print all lines in the file “Crime and Punishment.txt” containing the word “CHAPTER” followed by one or more blank spaces followed by 1 or more upper case letters.

Answer

import re
hand = open('Crime and Punishment.txt',"r")
lines = 0
for line in hand:
    lines += 1
    if re.search("CHAPTER +[A-Z]+", line):
        print(line)
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## CHAPTER VIII
print(lines)
## 21970

Follow-up

We could relax the requirement to find CHAPTER and also allow lines containg PART to qualify. The character ‘|’ signifies “or.”

Example

import re
hand = open('Crime and Punishment.txt',"r")
lines = 0
for line in hand:
    lines += 1
    if re.search("CHAPTER|PART +[A-Z]+", line):
        print(line)
## PART I
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## PART II
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## PART III
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## PART IV
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## PART V
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## PART VI
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## CHAPTER VIII
print(lines)
## 21970

Exercise

Read the file ‘mbox-short.txt’.

Find and count all lines containing the string ‘.edu’ followed by one or more blank spaces. Note to use the period as a regular character instead of a wildcard, precede it with a backslash.

import re
hand = open('mbox-short.txt',"r")
lines = 0
for line in hand:
    if re.search("\.edu +", line):
        lines += 1
print(lines)
## 315

Exercise

Read the file ‘Crime and Punishment.txt’ and look for a question mark following a lower case letter followed by a blank spaces. Count the lines that match this pattern. Note to use the question mark as a regular character, precede it with a backslash.

Answer

import re
hand = open('Crime and Punishment.txt',"r")
lines = 0
for line in hand:
    if re.search("[a-z]\? ", line):
        lines += 1
print(lines)
## 951

Exercise

Read the files ‘Crime and Punishmens.txt’, ‘mbox-short.txt’, and ‘Pride and Prejudice.txt’. Look for one or more digits surrounded by blankspaces. Count the lines that match this pattern. Also count the total number of times the pattern occurs.

Answer

import re
hand = open('mbox-short.txt',"r")
lines = 0
patterns = 0
for line in hand:
    matches = re.findall(" [0-9]+ ", line)
    if len(matches) > 0:
        lines += 1
        patterns += len(matches)
print(lines)
## 363
print(patterns)
## 641

More Sets of Characters

Regular expressions use a  followed by a letter to donate several useful sets of characters.

https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference is an only slightly overwhelming reference.

More sets

Instead of just [a-z] or [A-Z] we can list specific sets such as [aeiou] or [AEIOU].

Example

import re
x = re.findall("^[AEIOU].+", "cat")
print(x)
## []
x = re.findall("^[AEIOU].+", "B cat")
print(x)
## []
x = re.findall("^[AEIOU].+", "A cat")
print(x)
## ['A cat']

Greedy and Non-Greedy

import re
heading  = '<h1>TITLE</h1>'
resg = re.findall('<.*>',heading)
print(resg)
## ['<h1>TITLE</h1>']
resng = re.findall('<.*?>',heading)
print(resng)
## ['<h1>', '</h1>']

Searching for Phone Numbers

Different patterns.

Write patterns for each of these and test them

Answer

import re
ph1 = "\d{10}"
# Exactly 10 digits
ph2 = "\(\d{3}\)\d{3}-\d{4}"
# Treat parentheses as regular characters
ph3 = "\d{3}-\d{3}-\d{4}"
phnos = "xxx 5152349999 (515)234-9999 515-234-9999"
res = re.findall(ph1+"|"+ph2+"|"+ph3,phnos)
print(res)
## ['5152349999', '(515)234-9999', '515-234-9999']