Harold Nelson
4/17/2018
library(reticulate)
Use a regular expression to find and print all lines in the file “Crime and Punishment.txt” containing the word “CHAPTER” followed by one or more blank spaces followed by 1 or more upper case letters.
import re
hand = open('Crime and Punishment.txt',"r")
lines = 0
for line in hand:
lines += 1
if re.search("CHAPTER +[A-Z]+", line):
print(line)
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## CHAPTER VII
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## CHAPTER VII
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## CHAPTER VII
##
## CHAPTER VIII
print(lines)
## 21970
We could relax the requirement to find CHAPTER and also allow lines containg PART to qualify. The character ‘|’ signifies “or.”
Example
import re
hand = open('Crime and Punishment.txt',"r")
lines = 0
for line in hand:
lines += 1
if re.search("CHAPTER|PART +[A-Z]+", line):
print(line)
## PART I
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## CHAPTER VII
##
## PART II
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## CHAPTER VII
##
## PART III
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## PART IV
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## PART V
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## PART VI
##
## CHAPTER I
##
## CHAPTER II
##
## CHAPTER III
##
## CHAPTER IV
##
## CHAPTER V
##
## CHAPTER VI
##
## CHAPTER VII
##
## CHAPTER VIII
print(lines)
## 21970
Read the file ‘mbox-short.txt’.
Find and count all lines containing the string ‘.edu’ followed by one or more blank spaces. Note to use the period as a regular character instead of a wildcard, precede it with a backslash.
import re
hand = open('mbox-short.txt',"r")
lines = 0
for line in hand:
if re.search("\.edu +", line):
lines += 1
print(lines)
## 315
Read the file ‘Crime and Punishment.txt’ and look for a question mark following a lower case letter followed by a blank spaces. Count the lines that match this pattern. Note to use the question mark as a regular character, precede it with a backslash.
import re
hand = open('Crime and Punishment.txt',"r")
lines = 0
for line in hand:
if re.search("[a-z]\? ", line):
lines += 1
print(lines)
## 951
Read the files ‘Crime and Punishmens.txt’, ‘mbox-short.txt’, and ‘Pride and Prejudice.txt’. Look for one or more digits surrounded by blankspaces. Count the lines that match this pattern. Also count the total number of times the pattern occurs.
import re
hand = open('mbox-short.txt',"r")
lines = 0
patterns = 0
for line in hand:
matches = re.findall(" [0-9]+ ", line)
if len(matches) > 0:
lines += 1
patterns += len(matches)
print(lines)
## 363
print(patterns)
## 641
Regular expressions use a followed by a letter to donate several useful sets of characters.
means any white-space character.
means any non-white-space character
means any digit
means any non-digit
means any word character
means any non-word character
https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference is an only slightly overwhelming reference.
Instead of just [a-z] or [A-Z] we can list specific sets such as [aeiou] or [AEIOU].
Example
import re
x = re.findall("^[AEIOU].+", "cat")
print(x)
## []
x = re.findall("^[AEIOU].+", "B cat")
print(x)
## []
x = re.findall("^[AEIOU].+", "A cat")
print(x)
## ['A cat']
import re
heading = '<h1>TITLE</h1>'
resg = re.findall('<.*>',heading)
print(resg)
## ['<h1>TITLE</h1>']
resng = re.findall('<.*?>',heading)
print(resng)
## ['<h1>', '</h1>']
Different patterns.
10 digits 3609569999
area code in parentheses then dash (360)956-9999
Two dash pattern 360-956-9999
Write patterns for each of these and test them
import re
ph1 = "\d{10}"
# Exactly 10 digits
ph2 = "\(\d{3}\)\d{3}-\d{4}"
# Treat parentheses as regular characters
ph3 = "\d{3}-\d{3}-\d{4}"
phnos = "xxx 5152349999 (515)234-9999 515-234-9999"
res = re.findall(ph1+"|"+ph2+"|"+ph3,phnos)
print(res)
## ['5152349999', '(515)234-9999', '515-234-9999']