Harold Nelson
10/29/2018
Use a regular expression to find and print all lines in the file “Crime and Punishment.txt” containing the word “CHAPTER” followed by one or more blank spaces followed by 1 or more upper case letters.
import re
hand = open('Crime and Punishment.txt',"r")
lines = 0
for line in hand:
lines += 1
if re.search("CHAPTER +[A-Z]+", line):
print(line)
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## CHAPTER VII
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## CHAPTER VII
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## CHAPTER VII
##
## CHAPTER VIII
## 21970
We could relax the requirement to find CHAPTER and also allow lines containg PART to qualify. The character ‘|’ signifies “or.”
Example
import re
hand = open('Crime and Punishment.txt',"r")
lines = 0
for line in hand:
lines += 1
if re.search("CHAPTER|PART +[A-Z]+", line):
print(line)
## PART I
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## CHAPTER VII
##
## PART II
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## CHAPTER VII
##
## PART III
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## PART IV
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## PART V
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## PART VI
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## CHAPTER VII
##
## CHAPTER VIII
## 21970
Read the file ‘mbox-short.txt’.
Find and count all lines containing the string ‘.edu’ followed by one or more blank spaces. Note: to use the period as a regular character instead of a wildcard, precede it with a backslash.
import re
hand = open('mbox-short.txt',"r")
lines = 0
for line in hand:
if re.search("\.edu +", line):
lines += 1
print(lines)
## 315
Read the file ‘Crime and Punishment.txt’ and look for a question mark following a lower case letter followed by a blank spaces. Count the lines that match this pattern. Note to use the question mark as a regular character, precede it with a backslash.
import re
hand = open('Crime and Punishment.txt',"r")
lines = 0
for line in hand:
if re.search("[a-z]\? ", line):
lines += 1
print(lines)
## 951
Read the files ‘mbox-short.txt’. Look for one or more digits surrounded by blankspaces. Count the lines that match this pattern. Also count the total number of times the pattern occurs.
import re
hand = open('mbox-short.txt',"r")
lines = 0
patterns = 0
for line in hand:
matches = re.findall(" [0-9]+ ", line)
if len(matches) > 0:
lines += 1
patterns += len(matches)
print(lines)
## 363
## 641
Regular expressions use a followed by a letter to donate several useful sets of characters.
\s means any white-space character.
\S means any non-white-space character
\d means any digit
\D means any non-digit
\w means any word character
\W means any non-word character
https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference is an only slightly overwhelming reference.
We can use {} to indicate a required number of characters.
How would you write a regex to identify a social security number?
## ['999-99-9999']
## []
Instead of just [a-z] or [A-Z] we can list specific sets such as [aeiou] or [AEIOU].
Example
## []
## []
## ['A cat']
## ['<h1>TITLE</h1>']
## ['<h1>', '</h1>']
Different patterns.
10 digits 3609569999
area code in parentheses then dash (360)956-9999
Two dash pattern 360-956-9999
Write patterns for each of these and test them
import re
ph1 = "\d{10}"
# Exactly 10 digits
ph2 = "\(\d{3}\)\d{3}-\d{4}"
# Treat parentheses as regular characters
ph3 = "\d{3}-\d{3}-\d{4}"
phnos = "6556 5152349999 (515)234-9999 515-234-9999 360-1245"
regex = ph1 + "|" + ph2 + "|" + ph3
res = re.findall(regex,phnos)
print(res)
## ['5152349999', '(515)234-9999', '515-234-9999']