Assignment on RPubs
Rmd on Github

  1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

     library(stringr)
    
     majorCSV <- read.csv("https://raw.githubusercontent.com/logicalschema/DATA607/master/week3/majors-list.csv")
    
    
     #The following code uses a regular expression DATA or STATISTICS and searches through the Major field of the data.
    grep('DATA|STATISTICS', majorCSV$Major,  value = TRUE)
    ## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
    ## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
    ## [3] "STATISTICS AND DECISION SCIENCE"


  2. Write code that transforms the data below:

    [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
    
    [5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
    
    [9] "elderberry"   "lime"         "lychee"       "mulberry"    
    
    [13] "olive"        "salal berry"

    Into a format like this:

    c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")


    w <- c("bell pepper","bilberry","blackberry","blood orange")
    x <- c("blueberry","cantaloupe","chili pepper","cloudberry") 
    y <- c("elderberry","lime","lychee","mulberry") 
    z <- c("olive","salal berry")
    
    
    combined <- c(w, x, y, z)
    
    print(combined)
    ##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
    ##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
    ## [11] "lychee"       "mulberry"     "olive"        "salal berry"


The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

  1. Describe, in words, what these expressions will match:

    (.)\1\1
    This expression matches on characters, except line breaks, that is followed by a “\1\1”. Examples would be “b\1\1”, “c\1\1”, or “5\1\1”.

    x <- c("b\1\1", "c\1\1", "hello\1\1", "yellow")
    str_match(x, '(.)\1\1')  
    ##      [,1]        [,2]
    ## [1,] "b\001\001" "b" 
    ## [2,] "c\001\001" "c" 
    ## [3,] "o\001\001" "o" 
    ## [4,] NA          NA

    “(.)(.)\2\1”
    This expression matches strings that contain pairs of characters, excluding line breaks, that are followed by a reverse of their order. Examples would be “abba”, “0101”, or “daad”.

    x <- c("abba", "0110", "ACTGGTCA", "yellow")
    str_match(x, "(.)(.)\\2\\1")  
    ##      [,1]   [,2] [,3]
    ## [1,] "abba" "a"  "b" 
    ## [2,] "0110" "0"  "1" 
    ## [3,] "TGGT" "T"  "G" 
    ## [4,] NA     NA   NA

    (..)\1
    This expression matches strings that have a couple of characters, excluding line breaks, that are followed by a “\1”. Examples would be “ab\1”, “54\1”, or “11\1”.

    x <- c("ab\1", "red", "A\1", "AABBCC\1")
    str_match(x, '(..)\1')  
    ##      [,1]     [,2]
    ## [1,] "ab\001" "ab"
    ## [2,] NA       NA  
    ## [3,] NA       NA  
    ## [4,] "CC\001" "CC"

    “(.).\1.\1” This expression matches strings that contain a character that repeats in the 2 and 4 places over from its first occurrence. Examples would be “a0a1a”, “c1d1e1”, and “-1-2-3”.

    x <- c("a0a1a", "blue", "c1d1e1", "-1-2-3")
    str_match(x, "(.).\\1.\\1")
    ##      [,1]    [,2]
    ## [1,] "a0a1a" "a" 
    ## [2,] NA      NA  
    ## [3,] "1d1e1" "1" 
    ## [4,] "-1-2-" "-"

    **(.)(.)(.).*\3\2\1** This expression matches any sequence of strings that are encapsulated by 3 characters, excluding line breaks, where the end string is a reverse order of those 3 characters. Examples would be “abcjfkdjkfjicba”, “0110Middleofthestring110”.

    x <- c("abcjfkdjkfjicba", "redyellow001middle100kdlskdls", "beginbegin1middlegebend", "98&^A")
    str_match(x, "(.)(.)(.).*\\3\\2\\1")  #Note the \ needs to be escaped in the expression.
    ##      [,1]                   [,2] [,3] [,4]
    ## [1,] "abcjfkdjkfjicba"      "a"  "b"  "c" 
    ## [2,] "001middle100"         "0"  "0"  "1" 
    ## [3,] "beginbegin1middlegeb" "b"  "e"  "g" 
    ## [4,] NA                     NA   NA   NA


  1. Construct regular expressions to match words that

    Start and end with the same character:
    "^(.)(.*)\1$"

    x <- c("amiddleofthestringa", "0red,yellow,green0", "red", "9*&^(^Hjshjshf9")
    str_match(x, "^(.)(.*)\\1$")  #Note the \ needs to be escaped in the expression.
    ##      [,1]                  [,2] [,3]               
    ## [1,] "amiddleofthestringa" "a"  "middleofthestring"
    ## [2,] "0red,yellow,green0"  "0"  "red,yellow,green" 
    ## [3,] NA                    NA   NA                 
    ## [4,] "9*&^(^Hjshjshf9"     "9"  "*&^(^Hjshjshf"

    Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.):
    "(..)(.*)\1"

    x <- c("church", "blue", "red", "abracadabra")
    str_match(x, "(..)(.*)\\1")  #Note the \ needs to be escaped in the expression.
    ##      [,1]        [,2] [,3]   
    ## [1,] "church"    "ch" "ur"   
    ## [2,] NA          NA   NA     
    ## [3,] NA          NA   NA     
    ## [4,] "abracadab" "ab" "racad"

    Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
    “(.)(.)\1(.)\1”

    x <- c("eleven", "blue", "010001001010", "yellow submarine light")
    str_match(x, "(.)(.*)\\1(.*)\\1")  #Note the \ needs to be escaped in the expression.
    ##      [,1]               [,2] [,3]       [,4]           
    ## [1,] "eleve"            "e"  "l"        "v"            
    ## [2,] NA                 NA   NA         NA             
    ## [3,] "010001001010"     "0"  "10001001" "1"            
    ## [4,] "llow submarine l" "l"  ""         "ow submarine "