text_strings <- c("We have to extract these numbers 12, 47, 48",
"The integers numbers are also interestings: 189 2036 314",
"','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456",
"We like to to offer you 7890$ per month in order to complete this task... we are joking",
"You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits.",
"Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life.",
"you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar",
"Writing 1 example is not funny, please consider that 66% is validation+testing",
"You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]",
"Who loves arrays more than me?",
"{366,78,90,5}Yes, there are only 4 numbers inside",
"Integers are fine but sometimes you like 99 cents after the 99 dollars",
"100 euro are better than 99 euro",
"I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]",
"Ok ok 1 2 3 4 5 and the last one is 6",
"33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando",
"hetot hhats",
"hetlltstf")
grep("(\\b|[^t])(t{2})(\\b|[^t])", text_strings, value = TRUE)
## [1] "100 euro are better than 99 euro"
## [2] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"
The main part of this code is t{2}. This expression
makes sure that two consecutive t’s are matched. However, if we would
just leave the code like that also words with three consecutive t’s
would be matched. Therefore, one has to add the expression
(\\b|[^t]) before and after. A regular [^t]
before and after the expression would have not been sufficient. The
[^t] expression would have not included words with more
than two consecutive t’s, however, it would have also not included words
that start or end with two consecutive t’s. A word like Bennett
would have been excluded. This is because there is no letter
after/before the double t that could be not t. To account for this case
the expression must be adapted as stated above. Now, before/after the
word there cannot be a t OR the word is over (which is
indicated by the \\b). The or is indicated by the operator
|.
grep("(t\\w*t)|(e\\w*e)", text_strings, value = TRUE)
## [1] "We have to extract these numbers 12, 47, 48"
## [2] "The integers numbers are also interestings: 189 2036 314"
## [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"
## [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"
## [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."
## [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
## [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"
## [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"
## [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
## [10] "{366,78,90,5}Yes, there are only 4 numbers inside"
## [11] "Integers are fine but sometimes you like 99 cents after the 99 dollars"
## [12] "100 euro are better than 99 euro"
## [13] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"
## [14] "hetot hhats"
## [15] "hetlltstf"
grep("(\\b|[^t])\\w*(t\\w*t)\\w*(\\b|[^t])", text_strings, value = TRUE)
## [1] "We have to extract these numbers 12, 47, 48"
## [2] "The integers numbers are also interestings: 189 2036 314"
## [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"
## [4] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."
## [5] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
## [6] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"
## [7] "Writing 1 example is not funny, please consider that 66% is validation+testing"
## [8] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
## [9] "100 euro are better than 99 euro"
## [10] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"
## [11] "hetot hhats"
## [12] "hetlltstf"
To find only words that contain the exactly two t’s one would use the
code t\\w*t. In this code we indicate that the word should
have two t’s (by stating t at the beginning and end) and put a
\\w* in between. The \\w stands for all word
characters (e.g., A-Z & a-z) and the * stands for at
least 0. This means that there can be 0 or more word characters between
the two t’s. Since \\w only contains word characters and
not space characters there is also no need to worry that a string gets
selected that has two t’s but in two different words. To get the above
code to select words with exactly two t’s or exactly two e’s, the or
operator is used (|) and the adapted code for the e is
added.
Comment: Here also words with more than two t’s/e’s would be selected. Please adapt.
grep("([^0-9]|\\b)[0-9]$", text_strings, value = TRUE)
## [1] "Ok ok 1 2 3 4 5 and the last one is 6"
With 0-9 all digits are considered. Alternatively one
could use \\d. The [0-9]$ ensures that there
is a digit at the end of the string. The ([^0-9]|\\b)
before that ensures that the symbol before the last digit is no other
digit, or that there is no symbol. The $ operator is
ensuring that it must be at the end, and the [^] operator
(important that it is in brackets) negates what is in the brackets. Here
again the or with the \\b was used with the same logic as
in 1. This was done to fit edge cases where the string is just
a single digit.
grep("([^0-9]|\\b)[0-9]{2}$", text_strings, value = TRUE)
## [1] "We have to extract these numbers 12, 47, 48"
This code follows the same logic, just the {2} was
added. This ensures that it are two digits and not one.
grep("[A-Z](\\S|\\s)*[A-Z]", text_strings, value = TRUE)
## [1] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
The most important part of this expression is the [A-Z].
This includes all capital letters. Because the task is to find all
strings that have more than one capital letter, this is stated twice.
The two code snippets to select the two (or more) capital letters is
separated by (\\S|\\s). With this part all characters (no
matter if non-space or space characters) are allowed. By adding the
*, at least 0 characters must be inbetween the two capital
letters.
grep("nu", text_strings, value = TRUE)
## [1] "We have to extract these numbers 12, 47, 48"
## [2] "The integers numbers are also interestings: 189 2036 314"
## [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"
## [4] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
## [5] "{366,78,90,5}Yes, there are only 4 numbers inside"
## [6] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"
This is pretty self explanatory.
grep("[^[:alnum:]\\s]", text_strings, value = TRUE, perl = TRUE)
## [1] "We have to extract these numbers 12, 47, 48"
## [2] "The integers numbers are also interestings: 189 2036 314"
## [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"
## [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"
## [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."
## [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
## [7] "Writing 1 example is not funny, please consider that 66% is validation+testing"
## [8] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
## [9] "Who loves arrays more than me?"
## [10] "{366,78,90,5}Yes, there are only 4 numbers inside"
## [11] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"
## [12] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"
This code consists of two main elements: [:alnum:] and
\\s. Both of these elements are in brackets with the
negation sign ^ before them. This means that regular
expressions are searched for that include a non alpha-numerical
character that is not a space.
grep("[^[:alnum:]\\s]", text_strings, value = TRUE, perl = TRUE, invert = TRUE)
## [1] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"
## [2] "Integers are fine but sometimes you like 99 cents after the 99 dollars"
## [3] "100 euro are better than 99 euro"
## [4] "Ok ok 1 2 3 4 5 and the last one is 6"
## [5] "hetot hhats"
## [6] "hetlltstf"
For this we used the same statement as above but added the argument
invert = TRUE. Using this, we can invert the code and
whereas it gave as all strings containing a special character in task 7
it now gives us all strings that do not contain a special character.
grep("[[:punct:]]$", text_strings, value = TRUE, perl = TRUE)
## [1] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."
## [2] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
## [3] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
## [4] "Who loves arrays more than me?"
## [5] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"
Here the dollar sign is used to refer to the end of the string. The
perl code [[:punct:]] is used to all punctuation marks.
grep("^[a-z]", text_strings, value = TRUE)
## [1] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"
## [2] "hetot hhats"
## [3] "hetlltstf"
Instead of the $ sign for indicating the end of a string
we can use the ^ sign to indicate the start of a string. In
this case the ^ is followed by [a-z]. This
means that the beginning of a string must be a letter that is in the
range of lowercase a to lowercase z.
sub("([A-Za-z0-9])$", "\\1.", text_strings)
## [1] "We have to extract these numbers 12, 47, 48."
## [2] "The integers numbers are also interestings: 189 2036 314."
## [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456."
## [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking."
## [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."
## [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
## [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar."
## [8] "Writing 1 example is not funny, please consider that 66% is validation+testing."
## [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
## [10] "Who loves arrays more than me?"
## [11] "{366,78,90,5}Yes, there are only 4 numbers inside."
## [12] "Integers are fine but sometimes you like 99 cents after the 99 dollars."
## [13] "100 euro are better than 99 euro."
## [14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"
## [15] "Ok ok 1 2 3 4 5 and the last one is 6."
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando."
## [17] "hetot hhats."
## [18] "hetlltstf."
Here we first check if the last letter of a string is alphanumeric.
This is done by [A-Za-z0-9]$. The A-Z contains
all upper case letters, the a-z contains all lower case
letters, and the 0-9 contains all digits. The dollar sign
after this expression is square brackets indicates that it has to be at
the end of the string. Furthermore, we put parentheses around it so we
can treat is as a group, in this case group 1.
In the replacement argument of our function now we wrote
\\1.. This argument is build up of two elements. The first
is the reference back to the group (\\1). With this it is
made sure that the last letter of a string that fits the pattern is
replaced by itself (i.e., it looks like it is kept). The additional
. is the punctuation mark that is added. This mark could
also be exchanged for any other punctuation mark, however, it might has
to be escapet with backwards slashes.
gsub("dollars", "\\$", text_strings)
## [1] "We have to extract these numbers 12, 47, 48"
## [2] "The integers numbers are also interestings: 189 2036 314"
## [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"
## [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"
## [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."
## [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
## [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"
## [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"
## [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
## [10] "Who loves arrays more than me?"
## [11] "{366,78,90,5}Yes, there are only 4 numbers inside"
## [12] "Integers are fine but sometimes you like 99 cents after the 99 $"
## [13] "100 euro are better than 99 euro"
## [14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"
## [15] "Ok ok 1 2 3 4 5 and the last one is 6"
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"
## [17] "hetot hhats"
## [18] "hetlltstf"
Here the gsub() function is used in case the word
dollars is used more than once in a string. Otherwise, this is
a very simple replacement of the word dollars by the dollar
sign.
gsub("euro", "€", text_strings)
## [1] "We have to extract these numbers 12, 47, 48"
## [2] "The integers numbers are also interestings: 189 2036 314"
## [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"
## [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"
## [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."
## [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
## [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"
## [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"
## [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
## [10] "Who loves arrays more than me?"
## [11] "{366,78,90,5}Yes, there are only 4 numbers inside"
## [12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"
## [13] "100 € are better than 99 €"
## [14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"
## [15] "Ok ok 1 2 3 4 5 and the last one is 6"
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"
## [17] "hetot hhats"
## [18] "hetlltstf"
Here the same logic is applied as above. However, no escape signs
(\\) had to be used, since the € sign is not a special
characters.
gsub("\\b(\\w+)\\b\\s\\b\\1\\b", "\\1", text_strings)
## [1] "We have to extract these numbers 12, 47, 48"
## [2] "The integers numbers are also interestings: 189 2036 314"
## [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"
## [4] "We like to offer you 7890$ per month in order to complete this task... we are joking"
## [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."
## [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
## [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"
## [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"
## [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
## [10] "Who loves arrays more than me?"
## [11] "{366,78,90,5}Yes, there are only 4 numbers inside"
## [12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"
## [13] "100 euro are better than 99 euro"
## [14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"
## [15] "Ok ok 1 2 3 4 5 and the last one is 6"
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"
## [17] "hetot hhats"
## [18] "hetlltstf"
The main part of the code is the (\\w+) and the
\\1. The (\\w+) creates a group that contains
a word of at least on letter. The \\1 is a reference back
to this group, which means that the code applies when the word is
repeated. The \\b in the beginning and end make sure that
the code is only deleting full words that occur twice and not words that
contain the same elements by coincidence (e.g., in “There is a hat at”
the second “at” is not deleted, because it is not fully match the
“hat”). Lastly, the \\s in the middle must be added to show
that there is a space between the two words and words like “coco” are
not affected. The \\1 in the replacement argument again
refers back to the group that was created.
x <- gsub("(\\d)(\\s)(\\d)", "\\1\\, \\3", text_strings)
gsub("(\\d)(\\s)(\\d)", "\\1\\, \\3", x)
## [1] "We have to extract these numbers 12, 47, 48"
## [2] "The integers numbers are also interestings: 189, 2036, 314"
## [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"
## [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"
## [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."
## [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
## [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"
## [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"
## [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
## [10] "Who loves arrays more than me?"
## [11] "{366,78,90,5}Yes, there are only 4 numbers inside"
## [12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"
## [13] "100 euro are better than 99 euro"
## [14] "I like to give you 1000 numbers now: 12, 3, 56, 21, 67, and more, [45,67,7]"
## [15] "Ok ok 1, 2, 3, 4, 5 and the last one is 6"
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"
## [17] "hetot hhats"
## [18] "hetlltstf"
The pattern used to find digits that are separated between them by a
space was (\\d)(\\s)(\\d). This pattern consists of three
groups. The first one (\\d) represents the first digit to
match, the second one (\\s) represents the space between
the digits, and the third one (\\d) represents the second
digit. In the replace argument the first and the third group are called,
with the desired comma and space between them. The issue with this code
is if there are multiple digits separated by a space (e.g.,
1 2 3 4). Here the code above would find two fits (e.g.,
1 2 and 3 4) and only put a comma and a space
between those (e.g., 1, 2 3, 4). However, by running the
same code again we get the desired result.
gsub("(\\d),(\\d)", "\\1, \\2", text_strings)
## [1] "We have to extract these numbers 12, 47, 48"
## [2] "The integers numbers are also interestings: 189 2036 314"
## [3] "','is a separator, so please extract these numbers 125, 789, 1450 and also these 564, 90456"
## [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"
## [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."
## [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
## [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"
## [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"
## [9] "You you are a genius, I think that you like arrays A LOT, [3, 45, 67, 900, 1974]"
## [10] "Who loves arrays more than me?"
## [11] "{366, 78, 90, 5}Yes, there are only 4 numbers inside"
## [12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"
## [13] "100 euro are better than 99 euro"
## [14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45, 67, 7]"
## [15] "Ok ok 1 2 3 4 5 and the last one is 6"
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"
## [17] "hetot hhats"
## [18] "hetlltstf"
Once again, two groups are created so we can replace those by
themselves. In this case the groups are the digit before the comma and
the digit after the comma. Those two groups are separated by a comma,
hence we have the expression (\\d),(\\d). In the
replacement argument we then call the groups individually and put the
, and a space between them.
gsub("\\b([a-z])(\\w*)", "\\U\\1\\E\\2", text_strings, perl=TRUE)
## [1] "We Have To Extract These Numbers 12, 47, 48"
## [2] "The Integers Numbers Are Also Interestings: 189 2036 314"
## [3] "','Is A Separator, So Please Extract These Numbers 125,789,1450 And Also These 564,90456"
## [4] "We Like To To Offer You 7890$ Per Month In Order To Complete This Task... We Are Joking"
## [5] "You Are Going To Learn 3 Things, The First One Is Not To Extract, And 2 And 3 Are Simply Digits."
## [6] "Have Fun With Our Mighty Test, You Are Going To Support Science, Progress, Mankind Wellness And You Are Going To Waste 30 Or 60 Minutes Of Your Life."
## [7] "You Can Also Extract Exotic Stuff Like A456 Gb67 And 45678911ghth Dollar"
## [8] "Writing 1 Example Is Not Funny, Please Consider That 66% Is Validation+Testing"
## [9] "You You Are A Genius, I Think That You Like Arrays A LOT, [3,45,67,900,1974]"
## [10] "Who Loves Arrays More Than Me?"
## [11] "{366,78,90,5}Yes, There Are Only 4 Numbers Inside"
## [12] "Integers Are Fine But Sometimes You Like 99 Cents After The 99 Dollars"
## [13] "100 Euro Are Better Than 99 Euro"
## [14] "I Like To Give You 1000 Numbers Now: 12 3 56 21 67, And More, [45,67,7]"
## [15] "Ok Ok 1 2 3 4 5 And The Last One Is 6"
## [16] "33 Trentini Entrarono A Trento, Tutti E 33 Di Tratto In Tratto Trotterellando"
## [17] "Hetot Hhats"
## [18] "Hetlltstf"
The following pattern is used to find words that begin with an lower
case letter: \\b([a-z])(\\w*). At the start of this pattern
there is a \\b, indicating that it is the start of a word.
After that the range [a-z] is defined to fit all words
starting with a lower case. Then, \\w* is used to fit all
the letters that could follow the starting letter. The * is
there so also words that have only one letter (e.g., “I”) will be
fitted. Furthermore, the starting letter of the word, as well as the
following letters were put in two groups. This is so we can modify them
and replace them by themselves later. The modification we did was to
modify group 1, e.g., the starting letter. We put the perl commands
\\U and \\E around it to indicate that it
should be in upper case. We did no modification to group 2.
gsub("(\\.)$", "\\1\\$", text_strings, perl=TRUE)
## [1] "We have to extract these numbers 12, 47, 48"
## [2] "The integers numbers are also interestings: 189 2036 314"
## [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"
## [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"
## [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits.$"
## [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life.$"
## [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"
## [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"
## [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
## [10] "Who loves arrays more than me?"
## [11] "{366,78,90,5}Yes, there are only 4 numbers inside"
## [12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"
## [13] "100 euro are better than 99 euro"
## [14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"
## [15] "Ok ok 1 2 3 4 5 and the last one is 6"
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"
## [17] "hetot hhats"
## [18] "hetlltstf"
Here we find the strings that finish with a dot by adding the
$ sing behind the escaped do \\.. We replace
this by the dot itself and by a escaped dollar sing.
gsub("(\\.)$", "\\^\\1", text_strings, perl=TRUE)
## [1] "We have to extract these numbers 12, 47, 48"
## [2] "The integers numbers are also interestings: 189 2036 314"
## [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"
## [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"
## [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits^."
## [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life^."
## [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"
## [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"
## [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
## [10] "Who loves arrays more than me?"
## [11] "{366,78,90,5}Yes, there are only 4 numbers inside"
## [12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"
## [13] "100 euro are better than 99 euro"
## [14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"
## [15] "Ok ok 1 2 3 4 5 and the last one is 6"
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"
## [17] "hetot hhats"
## [18] "hetlltstf"
Here we use the same pattern as above again. That we grouped the dot
sign allows us to rearrange it in our replacement pattern. Therefore, we
can just add the escaped ^ before our call of group one
\\1.
gsub("^\\b([a-z])", "\\^\\1", text_strings, perl=TRUE)
## [1] "We have to extract these numbers 12, 47, 48"
## [2] "The integers numbers are also interestings: 189 2036 314"
## [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"
## [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"
## [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."
## [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
## [7] "^you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"
## [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"
## [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"
## [10] "Who loves arrays more than me?"
## [11] "{366,78,90,5}Yes, there are only 4 numbers inside"
## [12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"
## [13] "100 euro are better than 99 euro"
## [14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"
## [15] "Ok ok 1 2 3 4 5 and the last one is 6"
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"
## [17] "^hetot hhats"
## [18] "^hetlltstf"
Our search pattern looks like this: ^\\b([a-z]). The
caret sign in the search pattern makes sure that we are searching at the
start of a string. The additional \\b makes sure that we
are looking at the start of the word, however, is not really necessary.
The grouped range from a to z makes sure that we only include lower case
letters as first letter. Grouping this first letter allows us to call it
later on. The replace pattern consists of the escaped caret sign and the
call of group 1.