Assignment_3

Problem 3 a: Rearrange name vector to be first name last name

Letes first understand the raw.data extraction logic

This below logic searches for a pattern that consists of [‘alphabets’‘,’‘.’ ’] that appear twice or more times in a sequence. Extract all gets all matches and does not stop at first match

raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

library(stringr)
 name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
 name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

What class is the ‘name’ vector

class(name)

## [1] "character"

we see it is character vector

The first Problem 1 says we have to rearragne the Vector to be in format of first name last name

To solve this lets extra all characters (pattern) starting with upper case alphabets that occur one or n number of times in a sequence

str_extract_all(name, "[A-Za-z]+")

## [[1]]
## [1] "Moe"     "Szyslak"
## 
## [[2]]
## [1] "Burns"      "C"          "Montgomery"
## 
## [[3]]
## [1] "Rev"     "Timothy" "Lovejoy"
## 
## [[4]]
## [1] "Ned"      "Flanders"
## 
## [[5]]
## [1] "Simpson" "Homer"  
## 
## [[6]]
## [1] "Dr"      "Julius"  "Hibbert"

Notice it did not solve the problem lets try a different way

Replace titles and middle names with empty string

Fname_LName<-str_replace(name,'[A-Za-z]+[.]','')

#Cleanup the extra space and ,
Fname_LName<-str_replace(Fname_LName,'[,]|, |^ ', '')
Fname_LName<-str_replace(Fname_LName,'  ', ' ')


Fname_LName

## [1] "Moe Szyslak"      "Burns Montgomery" "Timothy Lovejoy" 
## [4] "Ned Flanders"     "Simpson Homer"    "Julius Hibbert"

PRoblem 3b Logical vector to show if someone has title

name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

str_detect(name, '^[A-Za-z]+[.]')

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

Problem 3c Construct a logical vector indicating whether a character has a second name.

name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

str_detect(name,' [A-Z][.]')

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

Problme 4a Describe regular expression and construct an example that matches [0-9]+\$

This matches a pattern of one or more digits in sequence ending with a literal $, useful when looking for dollar amounts

#Example

str_extract_all('Product 1 sales 5000$   Product 2 Sales 400$','[0-9]+\\$')

## [[1]]
## [1] "5000$" "400$"

Problem 4b

\b[a-z]{1,4}\b

The expression matches small case alphabets a-z atleast one time and not more than 4 times, with words ending with the small case alphabet, notice that it does not extract “hard” even though it is a 4 letters in sequence

str_extract_all('I love hardwork.','\\b[a-z]{1,4}\\b')

## [[1]]
## [1] "love"

Problem 4c .*?\.txt$

The expression matches input string starting with any character zero or more times, and ending with .txt. Note that will return a string even with spaces and punctation as long as the last characters are .txt and won’t return even if there is a .txt prior the last characters in the input string

str_extract_all('File_001.txt','.*?\\.txt$')

## [[1]]
## [1] "File_001.txt"

#Here we want to retun only file_001.txt but using this it would not work
str_extract_all('File_001.txt, File_003.csv','.*?\\.txt$')

## [[1]]
## character(0)

#to address this issue we use word edge \\b
str_extract_all('File_001.txt, File_003.csv','.*?.txt\\b')

## [[1]]
## [1] "File_001.txt"

Problem 4d \d{2}/\d{2}/\d{4}

This pattern is looking for a string that starts with two consecutive digits, a ‘/’ another 2 consecutive digits, another / and then 4 consecutive digits,

This is like a date of birth format ie (04/12/1987)

str_extract('John smith Date of birth is 22/12/1990','\\d{2}/\\d{2}/\\d{4}')

## [1] "22/12/1990"

Problem 4 e

<(.+?)>.+?</\1>

This pattern(backreference pattern) looks for a character string starting with < then one or more of any character then followed by >, It picks up all the characters after the first match Till it sees that first match but with ‘</’ presiding it

str_extract('1. A <xml>mall gfg <gh> </xml></</<\\1entence. - 2. Another tiny sentence.','<(.+)>.+?</\\1>')

## [1] "<xml>mall gfg <gh> </xml>"

#This example is of a catalog of books xml file that has some attributes for various books in the catalog, The Regex expression will first find the first  tag that is any character then extracts all the still till it reaches the closing  tag
catalog_xmlTag <- "<catalog>
                      <book id=1>
                        <author>Gambardella, Matthew</author>
                        <title>XML Developer's Guide</title>
                        <genre>Computer</genre>
                      </book>
                      <book id=2>
                        <author>Ralls, Kim</author>
                        <title>Midnight Rain</title>
                        <genre>Fantasy</genre>
                      </book>
                   </catalog>"

str_extract(catalog_xmlTag,'<(.+).+?</\\1>')

## [1] "<author>Gambardella, Matthew</author>"

#Notice here that it did not match catalog or book tag because they have new #line characters in between them which is not part of our search

Problem 9 attempt

clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr

code <-'clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr'


#extract all numbers
 code_nums <-str_extract_all(code , '[0-9]+')

 #use another   to store only all alphabet characters 
codealphabet <- code

codealphabet <- str_replace_all(codealphabet, '[0-9]+','')
codelowercase <- str_replace_all(codealphabet, '[a-z]','')
codeuppercase <- str_replace_all(codealphabet, '[A-Z]','')




#the answer is this
codelowercase

## [1] "CONGRATULATIONS.YOU.ARE.A.SUPERNERD!"

Assignment_3_Data607

Nnaemezue Obi-Eyisi

February 18, 2017