607 Week 4 Assignment

Chirag Vithalani

February 16, 2016


    1. Copy the introductory example. The vector name stores the extracted names.
      R> name
      [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
      [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
      
      1. Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

      Below is what we have

      library(stringr)
      raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555
      -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
      
      name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
      name
      ## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
      ## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"
      • To get standard names below are steps

        1. we need to convert list to dataframe
        2. rename it
        3. #convert list to dataframe
          namesdf<-do.call(rbind, lapply(name, data.frame, stringsAsFactors=FALSE))
          #rename column
          namesdf$names<-namesdf$X..i..
        4. If name contains comma, that means last name is first and first name is last.so re-arrange it
        5. namesdf$stdFormatNames<-ifelse(grepl( ",",namesdf$names),paste(word(namesdf$names,-1),word(namesdf$names,1)),namesdf$names)
        6. remove prefixes and commas
        7. namesdf$stdFormatNames<-gsub("Rev.|Dr.|,","", namesdf$stdFormatNames)
          namesdf$stdFormatNames
          ## [1] "Moe Szyslak"      "Montgomery Burns" " Timothy Lovejoy"
          ## [4] "Ned Flanders"     "Homer Simpson"    " Julius Hibbert"
      1. Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).
    • Using str_detect to check if there is title in name

      namesdf$names
      ## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
      ## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"
      namesdf$hasTitle<-str_detect(namesdf$names, "Rev.|Dr.")
      namesdf$names
      ## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
      ## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"
      namesdf[,c("names","hasTitle")]
      ##                  names hasTitle
      ## 1          Moe Szyslak    FALSE
      ## 2 Burns, C. Montgomery    FALSE
      ## 3 Rev. Timothy Lovejoy     TRUE
      ## 4         Ned Flanders    FALSE
      ## 5       Simpson, Homer    FALSE
      ## 6   Dr. Julius Hibbert     TRUE
      1. Construct a logical vector indicating whether a character has a second name.
    • Spliting by space

      namesdf$stdFormatNames
      ## [1] "Moe Szyslak"      "Montgomery Burns" " Timothy Lovejoy"
      ## [4] "Ned Flanders"     "Homer Simpson"    " Julius Hibbert"
      grepl( " ",str_trim(namesdf$stdFormatNames))
      ## [1] TRUE TRUE TRUE TRUE TRUE TRUE
    1. Consider the string +++BREAKING NEWS+++ . We would like to extract the first HTML tag. To do so we write the regular expression <.+>. Explain why this fails and correct the expression.
    • R applies greedy quantification. This means that the program tries to extract the greatest possible sequence of the preceding character. As the . matches any character, the function returns the greatest possible sequence of any characters before a sequence of sentence.

      stringToSearch<-"<title>+++BREAKING NEWS+++</title>"
      str_extract(stringToSearch,"<.+>")
      ## [1] "<title>+++BREAKING NEWS+++</title>"

      We can change this behavior by adding a ? to the expression in order to signal that we are only looking for the shortest possible sequence of any characters before a sequence of sentence.

      str_extract(stringToSearch, "<.+?>")
      ## [1] "<title>"
    1. Consider the string (5-3)2=52-253+3^2 conforms to the binomial theorem. We would like to extract the formula in the string. To do so we write the regular expression [^0-9=+*()]+. Explain why this fails and correct the expression.
    • Putting the caret at the beginning of a character class does the inverse. keeping it at the end resolves the problem.

      binomial_str <- "(5-3)^2=5^2-2*5*3+3^2 conforms to the binomial theorem."
      str_extract(binomial_str, "[^0-9=+*()]+")
      ## [1] "-"
      str_extract(binomial_str, "[0-9=+*()^-]+")
      ## [1] "(5-3)^2=5^2-2*5*3+3^2"