Actually we are familiar to regular expressions (RegEx) that are powerful tools for pattern matching and text manipulation. from Web Scrabing. Regular expressions can be used to extract specific information from HTML or other any text data. The exercises below demonstrate the application of RegEx in different scenarios.
Vector of strings is given
vector <- c("emoticon", ":)", "symbol", "$^$")
writeLines((vector))
## emoticon
## :)
## symbol
## $^$
# Use the function str_view() and find in vector:
# a) string of 3 characters with the letter o in the middle
str_view(vector, '.o.')
## [1] │ e<mot>i<con>
## [3] │ sym<bol>
# b) expression "emoticon"
str_view(vector, "^emoticon$")
## [1] │ <emoticon>
# c) expression ":)"
str_view(vector, "^\\:\\)$")
## [2] │ <:)>
# d) expression "$^$"
str_view(vector, "^\\$\\^\\$$")
## [4] │ <$^$>
Corpus of 980 words is given stringr::words
# Use the function str_view() and find in the corpus:
# a) all words containing the expression "yes" (add the parameter match=T)
str_view(stringr::words, "yes")
## [976] │ <yes>
## [977] │ <yes>terday
# b) all words starting with "w"
str_view(stringr::words, "^w")
## [922] │ <w>age
## [923] │ <w>ait
## [924] │ <w>alk
## [925] │ <w>all
## [926] │ <w>ant
## [927] │ <w>ar
## [928] │ <w>arm
## [929] │ <w>ash
## [930] │ <w>aste
## [931] │ <w>atch
## [932] │ <w>ater
## [933] │ <w>ay
## [934] │ <w>e
## [935] │ <w>ear
## [936] │ <w>ednesday
## [937] │ <w>ee
## [938] │ <w>eek
## [939] │ <w>eigh
## [940] │ <w>elcome
## [941] │ <w>ell
## ... and 33 more
# c) all words ending with "x"
str_view(stringr::words, "x$")
## [108] │ bo<x>
## [747] │ se<x>
## [772] │ si<x>
## [841] │ ta<x>
Corpus of 980 words is given stringr::words
# Use the function str_view() and find in the corpus:
# a) all words starting with a vowel
str_view(stringr::words, '^[aeiouAEIOU]')
## [1] │ <a>
## [2] │ <a>ble
## [3] │ <a>bout
## [4] │ <a>bsolute
## [5] │ <a>ccept
## [6] │ <a>ccount
## [7] │ <a>chieve
## [8] │ <a>cross
## [9] │ <a>ct
## [10] │ <a>ctive
## [11] │ <a>ctual
## [12] │ <a>dd
## [13] │ <a>ddress
## [14] │ <a>dmit
## [15] │ <a>dvertise
## [16] │ <a>ffect
## [17] │ <a>fford
## [18] │ <a>fter
## [19] │ <a>fternoon
## [20] │ <a>gain
## ... and 155 more
# b) all words that start only with a consonant
str_view(stringr::words, '^[^aeiouAEIOU]')
## [66] │ <b>aby
## [67] │ <b>ack
## [68] │ <b>ad
## [69] │ <b>ag
## [70] │ <b>alance
## [71] │ <b>all
## [72] │ <b>ank
## [73] │ <b>ar
## [74] │ <b>ase
## [75] │ <b>asis
## [76] │ <b>e
## [77] │ <b>ear
## [78] │ <b>eat
## [79] │ <b>eauty
## [80] │ <b>ecause
## [81] │ <b>ecome
## [82] │ <b>ed
## [83] │ <b>efore
## [84] │ <b>egin
## [85] │ <b>ehind
## ... and 785 more
# c) all words ending with "ing" or "ise"
str_view(stringr::words, '(ing|ise)$')
## [15] │ advert<ise>
## [113] │ br<ing>
## [251] │ dur<ing>
## [280] │ even<ing>
## [288] │ exerc<ise>
## [448] │ k<ing>
## [512] │ mean<ing>
## [533] │ morn<ing>
## [588] │ otherw<ise>
## [637] │ pract<ise>
## [674] │ ra<ise>
## [681] │ real<ise>
## [709] │ r<ing>
## [710] │ r<ise>
## [765] │ s<ing>
## [834] │ surpr<ise>
## [860] │ th<ing>
# d) all words ending with "ed" but not with "eed"
str_view(stringr::words, '[^e]ed$')
## [82] │ <bed>
## [410] │ hund<red>
## [690] │ <red>
# -------------------------------------------------#