Exercise 11.6.1

For the following regular expression, explain in words what it matches on. Then add test strings to demonstrate that it in fact does match on the pattern you claim it does. Make sure that your test set of strings has several examples that match as well as several that do not.

(a) This regular expression (‘a’) matches: the character ‘a’ and returns TRUE if the string contains that character at least one time.

##      string result
## 1       cat   TRUE
## 2       dog  FALSE
## 3 flagstaff   TRUE
## 4      mood  FALSE

(b) This regular expression (‘ab’) matches: the sequential substring ‘ab’ and returns true if the string contains ‘ab.’

##     string result
## 1      cab   TRUE
## 2    comfy  FALSE
## 3 abstract   TRUE
## 4      bad  FALSE
## 5    amber  FALSE

(c) This regular expression (‘[ab]’) matches: the characters ‘a’ or ‘b’ anywhere in the string.

##   string result
## 1    cab   TRUE
## 2  comfy  FALSE
## 3    bad   TRUE
## 4 camper   TRUE
## 5  amber   TRUE
## 6    bud   TRUE

(d) This regular expression (‘^ [ab]’) matches: a string that starts with ‘a’ or ‘b’.

##   string result
## 1    cab  FALSE
## 2  comfy  FALSE
## 3    bad   TRUE
## 4 camper  FALSE
## 5    bud   TRUE
## 6   milk  FALSE

(e) This regular expression (‘\d+\s[aA]’) matches: a substring with one or more repetitions of any digit, followed by any one repetition of whitespace and the character ‘a’ or ‘A’.

##         string result
## 1 Blue Flora15  FALSE
## 2       66 art   TRUE
## 3       99 Art   TRUE
## 4   22  apen15  FALSE
## 5        2 abc   TRUE
## 6  Albuquerque  FALSE
## 7       camper  FALSE
## 8         food  FALSE

(f) This regular expression (’ \d+\s*[aA]‘) matches: one or more repetitions of any digit, zero ore more reptitions of any whitespace, and the characters ’a’ or ‘A’.

##        string result
## 1        11 a   TRUE
## 2      11abcd   TRUE
## 3        4Abc   TRUE
## 4  423    Abc   TRUE
## 5       pen15  FALSE
## 6 Albuquerque  FALSE
## 7  11 afjkdla   TRUE
## 8        food  FALSE

(g) This regular expression (’.*’) matches: zero or more repetitions of any character.

##   string result
## 1   aaaa   TRUE
## 2    bad   TRUE
## 3          TRUE
## 4          TRUE
## 5   <NA>     NA

(h) This regular expression (‘^\w{2}bar’) matches: starts with two repetitions of any alphanumeric character with ‘bar’ at the end.

##          string result
## 1         aabar   TRUE
## 2 11barkerfluff   TRUE
## 3          abar  FALSE
## 4         1baar  FALSE
## 5         2abar   TRUE
## 6         $$bar  FALSE
## 7        $aabar  FALSE

(i) This regular expression (‘(foo\.bar)|(^\w{2}bar)’) matches: a substring starting with ‘foo’ followed by a period and ending in ‘bar,’ OR a substring that starts with two repetitions of any alphanumeric character followed by ‘bar’.

##    string result
## 1 foo.bar   TRUE
## 2 foo1bar  FALSE
## 3 foodbar  FALSE
## 4 44barry   TRUE
## 5  twenty  FALSE
## 6   abbar   TRUE
## 7   1abar   TRUE

Exercise 11.6.2

The following file names were used in a camera trap study. The S number represents the site, P is the plot within a site, C is the camera number within the plot, the first string of numbers is the YearMonthDay and the second string of numbers is the HourMinuteSecond.

Produce a data frame with columns corresponding to the site, plot, camera, year, month, day, hour, minute, and second for these three file names.

##   site plot camera year month day hour minute second
## 1 S123   P2    C10 2012    06  21   21     34     22
## 2  S10   P1     C1 2012    06  22   05     01     48
## 3 S187   P2     C2 2012    07  02   02     35     01

Exercise 11.6.3

The full text from Lincoln’s Gettysburg Address is given below. Calculate the mean word length. Note: consider ‘battle-field’ as one word with 11 letters.

mean.wordlength
4.239852

Exercise 11.6.4

Variable names in R may be any combination of letters, digits, period, and underscore. However, they may not start with a digit and if they start with a period, they must not be followed by a digit.

The first four are valid variable names, but the last four are not.

(a) First write a regular expression that determines if the string starts with a character (upper or lower case) or underscore and then is followed by zero or more numbers, letters, periods or underscores. Notice I use the start/end of string markers. This is important so that we don’t just match somewhere in the middle of the variable name.

data.frame( string=strings ) %>%
  mutate( result = str_detect(string, '^[a-zA-Z_](\\w|\\.|_)*$' )) 
##       string result
## 1      foo15   TRUE
## 2        Bar   TRUE
## 3     .resid  FALSE
## 4       _14s   TRUE
## 5 99_Bottles  FALSE
## 6    .9Arggh  FALSE
## 7       Foo!  FALSE
## 8   HIV Rate  FALSE
## 9    abc_def   TRUE
#this accommodates any order of digits/letters/periods/underscores

(b) Modify your regular expression so that the first group could be either [a-zA-Z_] as before or it could be a period followed by letters or an underscore.

data.frame( string=strings ) %>%
  mutate( result = str_detect(string, '^[a-zA-Z_\\.](\\w|\\.|_)*$' )) 
##       string result
## 1      foo15   TRUE
## 2        Bar   TRUE
## 3     .resid   TRUE
## 4       _14s   TRUE
## 5 99_Bottles  FALSE
## 6    .9Arggh   TRUE
## 7       Foo!  FALSE
## 8   HIV Rate  FALSE
## 9    abc_def   TRUE