Character classes and alternatives

suppressPackageStartupMessages(library("tidyverse"))

package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3

1. Create regular expressions to find all words that:

1. Words starting with vowels

str_subset(stringr::words, "^[aeiou]")

  [1] "a"           "able"        "about"       "absolute"    "accept"      "account"    
  [7] "achieve"     "across"      "act"         "active"      "actual"      "add"        
 [13] "address"     "admit"       "advertise"   "affect"      "afford"      "after"      
 [19] "afternoon"   "again"       "against"     "age"         "agent"       "ago"        
 [25] "agree"       "air"         "all"         "allow"       "almost"      "along"      
 [31] "already"     "alright"     "also"        "although"    "always"      "america"    
 [37] "amount"      "and"         "another"     "answer"      "any"         "apart"      
 [43] "apparent"    "appear"      "apply"       "appoint"     "approach"    "appropriate"
 [49] "area"        "argue"       "arm"         "around"      "arrange"     "art"        
 [55] "as"          "ask"         "associate"   "assume"      "at"          "attend"     
 [61] "authority"   "available"   "aware"       "away"        "awful"       "each"       
 [67] "early"       "east"        "easy"        "eat"         "economy"     "educate"    
 [73] "effect"      "egg"         "eight"       "either"      "elect"       "electric"   
 [79] "eleven"      "else"        "employ"      "encourage"   "end"         "engine"     
 [85] "english"     "enjoy"       "enough"      "enter"       "environment" "equal"      
 [91] "especial"    "europe"      "even"        "evening"     "ever"        "every"      
 [97] "evidence"    "exact"       "example"     "except"      "excuse"      "exercise"   
[103] "exist"       "expect"      "expense"     "experience"  "explain"     "express"    
[109] "extra"       "eye"         "idea"        "identify"    "if"          "imagine"    
[115] "important"   "improve"     "in"          "include"     "income"      "increase"   
[121] "indeed"      "individual"  "industry"    "inform"      "inside"      "instead"    
[127] "insure"      "interest"    "into"        "introduce"   "invest"      "involve"    
[133] "issue"       "it"          "item"        "obvious"     "occasion"    "odd"        
[139] "of"          "off"         "offer"       "office"      "often"       "okay"       
[145] "old"         "on"          "once"        "one"         "only"        "open"       
[151] "operate"     "opportunity" "oppose"      "or"          "order"       "organize"   
[157] "original"    "other"       "otherwise"   "ought"       "out"         "over"       
[163] "own"         "under"       "understand"  "union"       "unit"        "unite"      
[169] "university"  "unless"      "until"       "up"          "upon"        "use"        
[175] "usual"

2. Words that contain only consonants

str_subset(stringr::words, "^[^aeiou]+$")

[1] "by"  "dry" "fly" "mrs" "try" "why"

This seems to require using the + pattern introduced later, unless one wants to be very verbose and specify words of certain lengths.

3. Words that end with “-ed” but not ending in “-eed”.

str_subset(stringr::words, "[^e]ed$")

[1] "bed"     "hundred" "red"

The pattern above will not match the word “ed”. If we wanted to include that, we could include it as a special case.

str_subset(c("ed", stringr::words), "(^|[^e])ed$")

[1] "ed"      "bed"     "hundred" "red"

4. Words ending in “ing” or “ise”:

str_subset(stringr::words, "i(ng|se)$")

 [1] "advertise" "bring"     "during"    "evening"   "exercise"  "king"      "meaning"  
 [8] "morning"   "otherwise" "practise"  "raise"     "realise"   "ring"      "rise"     
[15] "sing"      "surprise"  "thing"

2. Empirically verify the rule “i” before e except after “c”.

length(str_subset(stringr::words, "(cei|[^c]ie)"))

[1] 14

length(str_subset(stringr::words, "(cie|[^c]ei)"))

[1] 3

3. Is “q” always followed by a “u”?

In the stringr::words dataset, yes.

str_view(stringr::words, "q[^u]", match = TRUE)

Registered S3 methods overwritten by 'htmltools':
  method               from         
  print.html           tools:rstudio
  print.shiny.tag      tools:rstudio
  print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio

In the English language— no. However, the examples are few, and mostly loanwords, such as “burqa” and “cinq”. Also, “qwerty”. That I had to add all of those examples to the list of words that spellchecking should ignore is indicative of their rarity.

4. Write a regular expression that matches a word if it’s probably written in British English, not American English.

In the general case, this is hard, and could require a dictionary. But, there are a few heuristics to consider that would account for some common cases: British English tends to use the following:

“ou” instead of “o”
use of “ae” and “oe” instead of “a” and “o”
ends in ise instead of ize
ends in yse

The regex ou|ise$|ae|oe|yse$ would match these.

There are other spelling differences between American and British English but they are not patterns amenable to regular expressions. It would require a dictionary with differences in spellings for different words.

5. Create a regular expression that will match telephone numbers as commonly written in your country.

For the United States, phone numbers have a format like 123-456-7890.

x <- c("123-456-7890", "1235-2351")
str_view(x, "\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d")

str_view(x, "[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]")

This regular expression can be simplified with the {m,n} regular expression modifier introduced in the next section,

str_view(x, "\\d{3}-\\d{3}-\\d{4}")

This answer can be improved and expanded. Note that this pattern doesn’t account for phone numbers that are invalid because of unassigned area code, or special numbers like 911, or extensions. See the Wikipedia page for the North American Numbering Plan for more information on the complexities of US phone numbers, and this Stack Overflow question for a discussion of using a regex for phone number validation.

LS0tDQp0aXRsZTogIkNoYXJhY3RlciBjbGFzc2VzIGFuZCBhbHRlcm5hdGl2ZXMiDQpvdXRwdXQ6IA0KICBodG1sX25vdGVib29rOg0KICAgIHRvYzogdHJ1ZQ0KICAgIHRvY19mbG9hdDogdHJ1ZQ0KLS0tDQoNCmBgYHtyfQ0Kc3VwcHJlc3NQYWNrYWdlU3RhcnR1cE1lc3NhZ2VzKGxpYnJhcnkoInRpZHl2ZXJzZSIpKQ0KYGBgDQoNCiMjIyAxLiBDcmVhdGUgcmVndWxhciBleHByZXNzaW9ucyB0byBmaW5kIGFsbCB3b3JkcyB0aGF0Og0KDQoNCioqMS4gV29yZHMgc3RhcnRpbmcgd2l0aCB2b3dlbHMqKg0KDQpgYGB7cn0NCnN0cl9zdWJzZXQoc3RyaW5ncjo6d29yZHMsICJeW2FlaW91XSIpDQpgYGANCg0KKioyLiBXb3JkcyB0aGF0IGNvbnRhaW4gb25seSBjb25zb25hbnRzKioNCg0KYGBge3J9DQpzdHJfc3Vic2V0KHN0cmluZ3I6OndvcmRzLCAiXlteYWVpb3VdKyQiKQ0KYGBgDQoNClRoaXMgc2VlbXMgdG8gcmVxdWlyZSB1c2luZyB0aGUgYCtgIHBhdHRlcm4gaW50cm9kdWNlZCBsYXRlciwgdW5sZXNzIG9uZSB3YW50cyB0byBiZSB2ZXJ5IHZlcmJvc2UgYW5kIHNwZWNpZnkgd29yZHMgb2YgY2VydGFpbiBsZW5ndGhzLg0KDQoqKjMuIFdvcmRzIHRoYXQgZW5kIHdpdGgg4oCcLWVk4oCdIGJ1dCBub3QgZW5kaW5nIGluIOKAnC1lZWTigJ0uKioNCg0KYGBge3J9DQpzdHJfc3Vic2V0KHN0cmluZ3I6OndvcmRzLCAiW15lXWVkJCIpDQpgYGANCg0KVGhlIHBhdHRlcm4gYWJvdmUgd2lsbCBub3QgbWF0Y2ggdGhlIHdvcmQgImVkIi4gSWYgd2Ugd2FudGVkIHRvIGluY2x1ZGUgdGhhdCwgd2UgY291bGQgaW5jbHVkZSBpdCBhcyBhIHNwZWNpYWwgY2FzZS4NCg0KYGBge3J9DQpzdHJfc3Vic2V0KGMoImVkIiwgc3RyaW5ncjo6d29yZHMpLCAiKF58W15lXSllZCQiKQ0KYGBgDQoNCioqNC4gV29yZHMgZW5kaW5nIGluICJpbmciIG9yICJpc2UiOioqDQoNCmBgYHtyfQ0Kc3RyX3N1YnNldChzdHJpbmdyOjp3b3JkcywgImkobmd8c2UpJCIpDQpgYGANCg0KIyMjIDIuIEVtcGlyaWNhbGx5IHZlcmlmeSB0aGUgcnVsZSDigJxp4oCdIGJlZm9yZSBlIGV4Y2VwdCBhZnRlciDigJxj4oCdLg0KDQpgYGB7cn0NCmxlbmd0aChzdHJfc3Vic2V0KHN0cmluZ3I6OndvcmRzLCAiKGNlaXxbXmNdaWUpIikpDQpsZW5ndGgoc3RyX3N1YnNldChzdHJpbmdyOjp3b3JkcywgIihjaWV8W15jXWVpKSIpKQ0KYGBgDQoNCiMjIyAzLiBJcyDigJxx4oCdIGFsd2F5cyBmb2xsb3dlZCBieSBhIOKAnHXigJ0/DQoNCkluIHRoZSBgc3RyaW5ncjo6d29yZHNgIGRhdGFzZXQsIHllcy4NCg0KYGBge3J9DQpzdHJfdmlldyhzdHJpbmdyOjp3b3JkcywgInFbXnVdIiwgbWF0Y2ggPSBUUlVFKQ0KYGBgDQoNCkluIHRoZSBFbmdsaXNoIGxhbmd1YWdl4oCUIFtub10oaHR0cHM6Ly9lbi53aWt0aW9uYXJ5Lm9yZy93aWtpL0FwcGVuZGl4OkVuZ2xpc2hfd29yZHNfY29udGFpbmluZ19RX25vdF9mb2xsb3dlZF9ieV9VKS4gSG93ZXZlciwgdGhlIGV4YW1wbGVzIGFyZSBmZXcsIGFuZCBtb3N0bHkgbG9hbndvcmRzLCBzdWNoIGFzIOKAnGJ1cnFh4oCdIGFuZCDigJxjaW5x4oCdLiBBbHNvLCDigJxxd2VydHnigJ0uIFRoYXQgSSBoYWQgdG8gYWRkIGFsbCBvZiB0aG9zZSBleGFtcGxlcyB0byB0aGUgbGlzdCBvZiB3b3JkcyB0aGF0IHNwZWxsY2hlY2tpbmcgc2hvdWxkIGlnbm9yZSBpcyBpbmRpY2F0aXZlIG9mIHRoZWlyIHJhcml0eS4NCg0KIyMjIDQuIFdyaXRlIGEgcmVndWxhciBleHByZXNzaW9uIHRoYXQgbWF0Y2hlcyBhIHdvcmQgaWYgaXTigJlzIHByb2JhYmx5IHdyaXR0ZW4gaW4gQnJpdGlzaCBFbmdsaXNoLCBub3QgQW1lcmljYW4gRW5nbGlzaC4NCg0KSW4gdGhlIGdlbmVyYWwgY2FzZSwgdGhpcyBpcyBoYXJkLCBhbmQgY291bGQgcmVxdWlyZSBhIGRpY3Rpb25hcnkuIEJ1dCwgdGhlcmUgYXJlIGEgZmV3IGhldXJpc3RpY3MgdG8gY29uc2lkZXIgdGhhdCB3b3VsZCBhY2NvdW50IGZvciBzb21lIGNvbW1vbiBjYXNlczogQnJpdGlzaCBFbmdsaXNoIHRlbmRzIHRvIHVzZSB0aGUgZm9sbG93aW5nOg0KDQogLSDigJxvdeKAnSBpbnN0ZWFkIG9mIOKAnG/igJ0NCiAtIHVzZSBvZiDigJxhZeKAnSBhbmQg4oCcb2XigJ0gaW5zdGVhZCBvZiDigJxh4oCdIGFuZCDigJxv4oCdDQogLSBlbmRzIGluIGlzZSBpbnN0ZWFkIG9mIGl6ZQ0KIC0gZW5kcyBpbiB5c2UNCg0KVGhlIHJlZ2V4IGBvdXxpc2UkfGFlfG9lfHlzZSRgIHdvdWxkIG1hdGNoIHRoZXNlLg0KDQpUaGVyZSBhcmUgb3RoZXIgW3NwZWxsaW5nIGRpZmZlcmVuY2VzIGJldHdlZW4gQW1lcmljYW4gYW5kIEJyaXRpc2ggRW5nbGlzaF0oaHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvQW1lcmljYW5fYW5kX0JyaXRpc2hfRW5nbGlzaF9zcGVsbGluZ19kaWZmZXJlbmNlcykgYnV0IHRoZXkgYXJlIG5vdCBwYXR0ZXJucyBhbWVuYWJsZSB0byByZWd1bGFyIGV4cHJlc3Npb25zLiBJdCB3b3VsZCByZXF1aXJlIGEgZGljdGlvbmFyeSB3aXRoIGRpZmZlcmVuY2VzIGluIHNwZWxsaW5ncyBmb3IgZGlmZmVyZW50IHdvcmRzLg0KDQojIyMgNS4gQ3JlYXRlIGEgcmVndWxhciBleHByZXNzaW9uIHRoYXQgd2lsbCBtYXRjaCB0ZWxlcGhvbmUgbnVtYmVycyBhcyBjb21tb25seSB3cml0dGVuIGluIHlvdXIgY291bnRyeS4NCg0KRm9yIHRoZSBVbml0ZWQgU3RhdGVzLCBwaG9uZSBudW1iZXJzIGhhdmUgYSBmb3JtYXQgbGlrZSAxMjMtNDU2LTc4OTAuDQoNCmBgYHtyfQ0KeCA8LSBjKCIxMjMtNDU2LTc4OTAiLCAiMTIzNS0yMzUxIikNCnN0cl92aWV3KHgsICJcXGRcXGRcXGQtXFxkXFxkXFxkLVxcZFxcZFxcZFxcZCIpDQpgYGANCg0Kb3INCg0KYGBge3J9DQpzdHJfdmlldyh4LCAiWzAtOV1bMC05XVswLTldLVswLTldWzAtOV1bMC05XS1bMC05XVswLTldWzAtOV1bMC05XSIpDQpgYGANCg0KVGhpcyByZWd1bGFyIGV4cHJlc3Npb24gY2FuIGJlIHNpbXBsaWZpZWQgd2l0aCB0aGUgYHttLG59YCByZWd1bGFyIGV4cHJlc3Npb24gbW9kaWZpZXIgaW50cm9kdWNlZCBpbiB0aGUgbmV4dCBzZWN0aW9uLA0KDQpgYGB7cn0NCnN0cl92aWV3KHgsICJcXGR7M30tXFxkezN9LVxcZHs0fSIpDQpgYGANCg0KVGhpcyBhbnN3ZXIgY2FuIGJlIGltcHJvdmVkIGFuZCBleHBhbmRlZC4gTm90ZSB0aGF0IHRoaXMgcGF0dGVybiBkb2VzbuKAmXQgYWNjb3VudCBmb3IgcGhvbmUgbnVtYmVycyB0aGF0IGFyZSBpbnZhbGlkIGJlY2F1c2Ugb2YgdW5hc3NpZ25lZCBhcmVhIGNvZGUsIG9yIHNwZWNpYWwgbnVtYmVycyBsaWtlIDkxMSwgb3IgZXh0ZW5zaW9ucy4gU2VlIHRoZSBXaWtpcGVkaWEgcGFnZSBmb3IgdGhlIFtOb3J0aCBBbWVyaWNhbiBOdW1iZXJpbmcgUGxhbl0oaHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvTm9ydGhfQW1lcmljYW5fTnVtYmVyaW5nX1BsYW4pIGZvciBtb3JlIGluZm9ybWF0aW9uIG9uIHRoZSBjb21wbGV4aXRpZXMgb2YgVVMgcGhvbmUgbnVtYmVycywgYW5kIFt0aGlzIFN0YWNrIE92ZXJmbG93IHF1ZXN0aW9uXShodHRwczovL3N0YWNrb3ZlcmZsb3cuY29tL3F1ZXN0aW9ucy8xMjM1NTkvYS1jb21wcmVoZW5zaXZlLXJlZ2V4LWZvci1waG9uZS1udW1iZXItdmFsaWRhdGlvbikgZm9yIGEgZGlzY3Vzc2lvbiBvZiB1c2luZyBhIHJlZ2V4IGZvciBwaG9uZSBudW1iZXIgdmFsaWRhdGlvbi4=