IS607 Assignment 4

library(stringr)

Question 4

Part A

For this question we were give the regular expression:[0-9]+\$ Breaking this down, we get [0-9] representing a single digit, the + represents a one or more instance of that single digit. The \$ indicates a lterally for an end in $. This was tested using the following example:

example.obj <- "1234 12345$12345679 9812$$58325$ 94332"
unlist(str_extract_all(example.obj, "[0-9]+\\$"))

## [1] "12345$" "9812$"  "58325$"

As you can see it only extracted digits that ended with a $. The length of digits preceeding the & were variable because of the +.

Part B

For this we were given the regular expression: \b[a-z]{1,4}\b Breaking down, we see that the whole expression is enclosed in the \b. This means the edge of the a word, so the expression is looking for a single word. the [a-z] indicates lower cased alpha characters, with the {1,4} points to a word of 1 to 4 in length. This was tested using the following code:

example.obj <- "
HAD I the heavens' embroidered cloths,
Enwrought with golden and silver light,
The blue and the dim and the dark cloths
Of night and light and the half light,
I would spread the cloths under your feet:
But I, being poor, have only my dreams;
I have spread my dreams under your feet;
Tread softly because you tread on my dreams. a " 

unlist(str_extract_all(example.obj, "\\b[a-z]{1,4}\\b"))

##  [1] "the"  "with" "and"  "blue" "and"  "the"  "dim"  "and"  "the"  "dark"
## [11] "and"  "and"  "the"  "half" "the"  "your" "feet" "poor" "have" "only"
## [21] "my"   "have" "my"   "your" "feet" "you"  "on"   "my"   "a"

This is one of my favorite Yeats’ poem provided the perfect example, as you can see only the words that were 1 to 4 in length were selected. I added a simple “a” at the end of the quote to show that single letter words were possible, but had to be lowercase.

Part C

Breaking dow the next expression: .*?\.txt$ we see the following. The first metacharcters indicate a selection of any number of characters as the . is used. The question mark allows for a blank value. The \.txt$ means that the selection must end in a literal .txt. This would be most usefull when searching a database for text file names.

example.obj <- c( "apple.txt", "1.text" , "123.text" , "superman vs. Doomsday.txt", ".txt" )

unlist(str_extract_all(example.obj, ".*?\\.txt$"))

## [1] "apple.txt"                 "superman vs. Doomsday.txt"
## [3] ".txt"

Part D

The breakdown of \d{2}/\d{2}/\d{4} is 2 digits followed by a foreward slash then 2 more digits another foreward slash, and ending in 4 more digits. This is likely to be a date in the form of 01/01/2000

example.obj <- c( "01-01-2001", "12/25/2015" , "09/20/2015" , "01/01/2010", "1/9/1991" )

unlist(str_extract_all(example.obj, "\\d{2}/\\d{2}/\\d{4}"))

## [1] "12/25/2015" "09/20/2015" "01/01/2010"

As you can see dates not in that form are not slected. (ie. only one digit for the day or month, or using -instead)

Part E

The breakdown of this was much more difficult, so we took it in bits

example.obj <- c( "<>", "<143 33>", "<teststest>" )

unlist(str_extract_all(example.obj, "<(.+?)>"))

## [1] "<143 33>"    "<teststest>"

So the first part of the expression requires there to be some text inbetween < >. The second part .+? means that there must be some text of variable length. The \1 denotes that we are backreferencing to the first grouping. As that first group was the (.+?) this is most likely a reference to an HTML code bit. The following example shows what is extracted from the data:

example.obj <- c( " <TITLE> Hello World! </TITLE>", " <HEAD> <TITLE> A Small Hello" )

unlist(str_extract_all(example.obj, "<(.+?)>.+?</\\1>"))

## [1] "<TITLE> Hello World! </TITLE>"

Question 5

First we could rewrite the [0-9] simply as [:digit:]. We could then write {n,} to replace the +. This would read:

example.obj <- "1234 12345$12345679 9812$$58325$ 94332"
unlist(str_extract_all(example.obj, "[:digit:]{1,}\\$"))

## [1] "12345$" "9812$"  "58325$"

Question 6

Part A

Using the str_replace function this could be done simply as follows:

email <-"chunkylover53[at]aol[dot]com"

email <- str_replace(email, pattern = fixed("[at]"), replacement = "@")
email <- str_replace(email, pattern = fixed("[dot]"), replacement = ".")

email

## [1] "chunkylover53@aol.com"

Part B

To test why this fails we run the code for extraction as is:

unlist(str_extract_all(email, "[:digit:]"))

## [1] "5" "3"

As you can see it did extract the numbers, however it extracted them separately. As we want these together it would be best to modify the phrase with a + or a {n,} quantifier.

unlist(str_extract_all(email, "[:digit:]{1,}"))

## [1] "53"

This extracts both digits together.

Part C

Again, the best way to see why this fails is to test it:

unlist(str_extract_all(email, "\\D"))

##  [1] "c" "h" "u" "n" "k" "y" "l" "o" "v" "e" "r" "@" "a" "o" "l" "." "c"
## [18] "o" "m"

This code does what its program to do and extracts all sting elements that aren’t digits. We essentially did the opposite of what we wanted. The \d would be more usefully, however again we have to add a quantifier.

unlist(str_extract_all(email, "\\d+"))

## [1] "53"

The following code would be used if we wanted to elimiate the digits alltogether

unlist(str_extract_all(email, "\\D+"))

## [1] "chunkylover" "@aol.com"

IS607 Assignment 4

Matthew Farris

September 20, 2015

Question 4

Part A

Part B

Part C

Part D

Part E

Question 5

Question 6

Part A

Part B

Part C