Github link: Git hub : Github link assignment 3 RPub Link:Rpub link
In this assignment we will be exploring the use of regular expressions so that we are able to extract useful information and also be able to manipulate data by making use of other useful string functions.
# First we create a variable that stores the raw data
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5543642Dr. Julius Hibbert"
# Extract information
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
#In this step we will be spliting the names by using a comma as a delimiter.
names<-str_split(name,"(\\,)")
names
## [[1]]
## [1] "Moe Szyslak"
##
## [[2]]
## [1] "Burns" " C. Montgomery"
##
## [[3]]
## [1] "Rev. Timothy Lovejoy"
##
## [[4]]
## [1] "Ned Flanders"
##
## [[5]]
## [1] "Simpson" " Homer"
##
## [[6]]
## [1] "Dr. Julius Hibbert"
getFirstNamesLastNames=function()
{
#Initializing a Vector to store FirstName LastName
FirstNamesLastNames=c()
#Looping through the names vector to search for names that are not in correct format
for(i in 1:length(names))
{
len= length(names[[i]])
#if length of the vector item is greater than 1 then the name needs to be rearranged
if(len>1)
{
fullName=rev(names[[i]])
#Rearranging the name and massaging the data
FirstNamesLastNames[i]=str_trim(str_c(str_extract(fullName[1],"[:alpha:]+")," ",fullName[2]))
}
else{
#These names are already in correct order but we are replacing the extraneous characters.
FirstNamesLastNames[i]=str_trim(str_replace(names[[i]],pattern="[:alpha:]+\\.",replacement=""))
}
}
# At this point the names are in correct format so we could return them
return(FirstNamesLastNames)
}
listOfFirstNameLastNames=getFirstNamesLastNames();
listOfFirstNameLastNames
## [1] "Moe Szyslak" "C Burns" "Timothy Lovejoy" "Ned Flanders"
## [5] "Homer Simpson" "Julius Hibbert"
#Constructing the logical vector and then printing it out.
IsTitleVector=str_detect(name, "[:alpha:]{2,}\\.")
IsTitleVector
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
name[IsTitleVector]
## [1] "Rev. Timothy Lovejoy" "Dr. Julius Hibbert"
IsInitialVector=str_detect(name, "[:space:][:alpha:]\\.")
IsInitialVector
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
name[IsInitialVector]
## [1] "Burns, C. Montgomery"
In the following regular expression it is expected to match one or more digits and then followed by exactly “$” sign
#Will Conform
sample <- "12345$"
regex = "[0-9]+\\$"
str_extract(sample, regex)
## [1] "12345$"
#Will not conform
sample <- "34567a$"
str_extract(sample, regex)
## [1] NA
This regex will extract the first four lower case letters of the word
#Will Conform
sample <- "abcd efgh"
regex = "\\b[a-z]{1,4}\\b"
str_extract(sample, regex)
## [1] "abcd"
sample <- "abCD efgh"
str_extract(sample, regex)
## [1] "efgh"
#Will Conform
sample <- "Address.txt"
regex = ".*?\\.txt$"
str_extract(sample, regex)
## [1] "Address.txt"
#Will Not Conform
sample <- "address.txtabcdef"
str_extract(sample, regex)
## [1] NA
#Will Not Conform
sample <- "address.csv"
str_extract(sample, regex)
## [1] NA
This particular regex is looking for 2 digits followed by a / and then 2 digits followed by another / then 4 digits. Basically it looks like the pattern for a date format MM/DD/YYYY or DD/MM/YYYY but any digits will conform and it does not validate a correct date.
#Will Conform
sample <- "22/10/2014"
regex = "\\d{2}/\\d{2}/\\d{4}"
str_extract(sample, regex)
## [1] "22/10/2014"
#Will Conform
sample <- "60/60/2014"
str_extract(sample, regex)
## [1] "60/60/2014"
#Will not Conform
sample <- "ad/23/2014"
str_extract(sample, regex)
## [1] NA
It is basically extracting the begin an end tags like in an HTML or XML document also using a back reference to make sure that the begin and end tag match. Notice how in the first example it picks up the second tag rather than the first. In the third example there is no match because the begin and end tags do not match.
#Conforms
regex = "<(.+?)>.+?</\\1>"
sample = "<Title>Sometext</head><body>Sometext</body>"
str_extract(sample, regex)
## [1] "<body>Sometext</body>"
#coforms
sample = "<html>Sometext</html>"
str_extract(sample, regex)
## [1] "<html>Sometext</html>"
#Does not Conform
sample = "<html>Sometext</htm>"
str_extract(sample, regex)
## [1] NA
encryptedCode = "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
regex = "[[:upper:].]+"
#The cracked code is below
str_replace_all(paste(unlist(str_extract_all(encryptedCode, regex)),collapse=""),pattern="[\\.]+",replacement=" ")
## [1] "CONGRATULATIONS YOU ARE A SUPERNERD"