CUNY SPS DATA 607 Assignment 3

Assignment
Problem 3
Problem 4
- Part (a)
- Part (b)
- Part (c)
- Part (d)
- Part (e)
Problem 9

Assignment

Complete problems 3 and 4 from chapter 8 of from Automated Data Collection with R by Simon Munzert, Christian Rubba, Peter Meißner, and Dominic Nyhuis. Problem 9 is extra credit.

Problem 3

Copy the introductory example. The vector name stores the extracted names.

name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

Part (a)

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

str_replace(name, "([^[,]]+),\\s(.+)","\\2 \\1")

## [1] "Moe Szyslak"          "C. Montgomery Burns"  "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Homer Simpson"        "Dr. Julius Hibbert"

Part (b)

Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

first_last <- str_detect(name, "[[:alpha:]]{2,}\\.")
first_last

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

Part (c)

Construct a logical vector indicating whether a character has a second name.

First add some additional second names for testing purposes

# Add some extra names for testing...
name <- c("Moe Ed Szyslak", "Burns, C. Montgomery", "Rev. Timothy Andrew Lovejoy", "Ned T. Flanders", "Simpson, Homer", "Dr. Julius Hibbert")
name

## [1] "Moe Ed Szyslak"              "Burns, C. Montgomery"       
## [3] "Rev. Timothy Andrew Lovejoy" "Ned T. Flanders"            
## [5] "Simpson, Homer"              "Dr. Julius Hibbert"

Logical vectors

# This works on the original list but only for a first name or middle name that is an initial not if it's two full names
str_detect(name, " \\w{1}\\.")

## [1] FALSE  TRUE FALSE  TRUE FALSE FALSE

# This works to capture both initials and full second names
str_detect(name, " \\w{1}\\.|(\\w+\\s){2}\\w+")

## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE

Rearrange the names to first last order

first_last <- str_replace(name, "([^[,]]+),\\s(.+)","\\2 \\1")
first_last

## [1] "Moe Ed Szyslak"              "C. Montgomery Burns"        
## [3] "Rev. Timothy Andrew Lovejoy" "Ned T. Flanders"            
## [5] "Homer Simpson"               "Dr. Julius Hibbert"

Logical vector

# This works on the rearranged list
str_detect(first_last, "\\b\\w{1}\\. |(\\w+\\s){2}\\w+")

## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE

Problem 4

Describe the types of strings that conform to the following regular expressions and construct and example that is matched by the regular expression.

I used the str_view_all function to shade only the matches in the examples below and to show examples of similar non-matching strings to highlight the differences between them.

Part (a)

[0-9]+\\$ would match one or more digits (and digits only) follwed by a dollar sign.

x <- c("$4343", "23543$", "23,543$")
str_view_all(x, "[0-9]+\\$")

Part (b)

\\b[a-z]{1,4}\\b would match 1 to 4 letter words in all lowercase letters.

x <- c("Betsy", "a", "bets", "bananas")
str_view_all(x, "\\b[a-z]{1,4}\\b")

Part (c)

.*?\\.txt$ would match zero or more characters of any type followed by ‘.txt’ at the end of the string.

x <- c("testfile.txt", "testfile.csv", ".txt", "file.txttt")
str_view_all(x, ".*?\\.txt$")

Part (d)

\\d{2}/\\d{2}/\\d{4} would match a year in MM/DD/YYYY or DD/MM/YYYY format.

x <- c("04/19/1972", "25/07/1993", "4/19/1972")
str_view_all(x, "\\d{2}/\\d{2}/\\d{4}")

Part (e)

<(.+?)>.+?</\\1> would matching opening and closing html tags with any text in between. The ? ensures that it will match everything from the opening tag to the first occurance of the matching closing tag. It will not match tags that have additional parameters in the opening tag like links.

x <- c("<strong>Hello World!</strong>", 
       '<a href="https://www.google.com">Google</a>', 
       "<strong>Hello</strong> <em>World</em>!", 
       "<em>overlapping<strong> tags</em>? Nested tag gets ignored.</strong>")
# Can't figure out how to get it to show the HTML tags in the output.  Sorry!
str_view_all(x, "<(.+?)>.+?</\\1>")

Problem 9

The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others!

x <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
message <- unlist(str_extract_all(x, "[[:upper:][:punct:] ]"))
noquote(paste(message, collapse = ""))

## [1] CONGRATULATIONS.YOU.ARE.A.SUPERNERD!

CUNY SPS DATA 607 Assignment 3

Betsy Rosalen

February 17, 2018

Assignment

Problem 3

Part (a)

Part (b)

Part (c)

First add some additional second names for testing purposes

Logical vectors

Rearrange the names to first last order

Logical vector

Problem 4

Part (a)

Part (b)

Part (c)

Part (d)

Part (e)

Problem 9