Regular Expressions 2

Answer

import re
hand = open('Crime and Punishment.txt',"r")
lines = 0
for line in hand:
    lines += 1
    if re.search("CHAPTER +[A-Z]+", line):
        print(line)

## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## CHAPTER VIII

print(lines)

## 21970

Follow-up

We could relax the requirement to find CHAPTER and also allow lines containg PART to qualify. The character ‘|’ signifies “or.”

Example

import re
hand = open('Crime and Punishment.txt',"r")
lines = 0
for line in hand:
    lines += 1
    if re.search("CHAPTER|PART +[A-Z]+", line):
        print(line)

## PART I
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## PART II
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## PART III
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## PART IV
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## PART V
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## PART VI
## 
## CHAPTER I
## 
## CHAPTER II
## 
## CHAPTER III
## 
## CHAPTER IV
## 
## CHAPTER V
## 
## CHAPTER VI
## 
## CHAPTER VII
## 
## CHAPTER VIII

print(lines)

## 21970

Exercise

Read the file ‘mbox-short.txt’.

Find and count all lines containing the string ‘.edu’ followed by one or more blank spaces. Note to use the period as a regular character instead of a wildcard, precede it with a backslash.

import re
hand = open('mbox-short.txt',"r")
lines = 0
for line in hand:
    if re.search("\.edu +", line):
        lines += 1
print(lines)

## 315

Exercise

Read the file ‘Crime and Punishment.txt’ and look for a question mark following a lower case letter followed by a blank spaces. Count the lines that match this pattern. Note to use the question mark as a regular character, precede it with a backslash.

Answer

import re
hand = open('Crime and Punishment.txt',"r")
lines = 0
for line in hand:
    if re.search("[a-z]\? ", line):
        lines += 1
print(lines)

## 951

Exercise

Read the files ‘Crime and Punishmens.txt’, ‘mbox-short.txt’, and ‘Pride and Prejudice.txt’. Look for one or more digits surrounded by blankspaces. Count the lines that match this pattern. Also count the total number of times the pattern occurs.

Answer

import re
hand = open('mbox-short.txt',"r")
lines = 0
patterns = 0
for line in hand:
    matches = re.findall(" [0-9]+ ", line)
    if len(matches) > 0:
        lines += 1
        patterns += len(matches)
print(lines)

## 363

print(patterns)

## 641

More Sets of Characters

Regular expressions use a followed by a letter to donate several useful sets of characters.

means any white-space character.
means any non-white-space character
means any digit
means any non-digit
means any word character
means any non-word character

https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference is an only slightly overwhelming reference.

More sets

Instead of just [a-z] or [A-Z] we can list specific sets such as [aeiou] or [AEIOU].

Example

import re
x = re.findall("^[AEIOU].+", "cat")
print(x)

## []

x = re.findall("^[AEIOU].+", "B cat")
print(x)

## []

x = re.findall("^[AEIOU].+", "A cat")
print(x)

## ['A cat']

Greedy and Non-Greedy

import re
heading  = '<h1>TITLE</h1>'
resg = re.findall('<.*>',heading)
print(resg)

## ['<h1>TITLE</h1>']

resng = re.findall('<.*?>',heading)
print(resng)

## ['<h1>', '</h1>']

Searching for Phone Numbers

Different patterns.

10 digits 3609569999
area code in parentheses then dash (360)956-9999
Two dash pattern 360-956-9999

Write patterns for each of these and test them

Answer

import re
ph1 = "\d{10}"
# Exactly 10 digits
ph2 = "\(\d{3}\)\d{3}-\d{4}"
# Treat parentheses as regular characters
ph3 = "\d{3}-\d{3}-\d{4}"
phnos = "xxx 5152349999 (515)234-9999 515-234-9999"
res = re.findall(ph1+"|"+ph2+"|"+ph3,phnos)
print(res)

## ['5152349999', '(515)234-9999', '515-234-9999']