Exercise 1

Import the Dataset

text_strings <- c("We have to extract these numbers 12, 47, 48",
                  "The integers numbers are also interestings: 189 2036 314",
                  "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456", 
                  "We like to to offer you 7890$ per month in order to complete this task... we are joking", 
                  "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits.", 
                  "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life.", 
                  "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar", 
                  "Writing 1 example is not funny, please consider that 66% is validation+testing", 
                  "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]",
                  "Who loves arrays more than me?", 
                  "{366,78,90,5}Yes, there are only 4 numbers inside",
                  "Integers are fine but sometimes you like 99 cents after the 99 dollars",
                  "100 euro are better than 99 euro",
                  "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]", 
                  "Ok ok 1 2 3 4 5 and the last one is 6", 
                  "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando",
                  "hetot hhats",
                  "hetlltstf")

Selection:

1. Select all the strings that have words that contain exactly two consecutive t’s.

grep("(\\b|[^t])(t{2})(\\b|[^t])", text_strings, value = TRUE)

## [1] "100 euro are better than 99 euro"                                             
## [2] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"

The main part of this code is t{2}. This expression makes sure that two consecutive t’s are matched. However, if we would just leave the code like that also words with three consecutive t’s would be matched. Therefore, one has to add the expression (\\b|[^t]) before and after. A regular [^t] before and after the expression would have not been sufficient. The [^t] expression would have not included words with more than two consecutive t’s, however, it would have also not included words that start or end with two consecutive t’s. A word like Bennett would have been excluded. This is because there is no letter after/before the double t that could be not t. To account for this case the expression must be adapted as stated above. Now, before/after the word there cannot be a t OR the word is over (which is indicated by the \\b). The or is indicated by the operator |.

2. Select all the strings that have words that contain exactly two t’s (not necessarily consecutive) or exactly two e’s.

grep("(t\\w*t)|(e\\w*e)", text_strings, value = TRUE)

##  [1] "We have to extract these numbers 12, 47, 48"                                                                                                          
##  [2] "The integers numbers are also interestings: 189 2036 314"                                                                                             
##  [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"                                                             
##  [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"                                                              
##  [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."                                                     
##  [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
##  [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"                                                                             
##  [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"                                                                       
##  [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"                                                                         
## [10] "{366,78,90,5}Yes, there are only 4 numbers inside"                                                                                                    
## [11] "Integers are fine but sometimes you like 99 cents after the 99 dollars"                                                                               
## [12] "100 euro are better than 99 euro"                                                                                                                     
## [13] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"                                                                        
## [14] "hetot hhats"                                                                                                                                          
## [15] "hetlltstf"

grep("(\\b|[^t])\\w*(t\\w*t)\\w*(\\b|[^t])", text_strings, value = TRUE)

##  [1] "We have to extract these numbers 12, 47, 48"                                                                                                          
##  [2] "The integers numbers are also interestings: 189 2036 314"                                                                                             
##  [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"                                                             
##  [4] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."                                                     
##  [5] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
##  [6] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"                                                                             
##  [7] "Writing 1 example is not funny, please consider that 66% is validation+testing"                                                                       
##  [8] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"                                                                         
##  [9] "100 euro are better than 99 euro"                                                                                                                     
## [10] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"                                                                        
## [11] "hetot hhats"                                                                                                                                          
## [12] "hetlltstf"

To find only words that contain the exactly two t’s one would use the code t\\w*t. In this code we indicate that the word should have two t’s (by stating t at the beginning and end) and put a \\w* in between. The \\w stands for all word characters (e.g., A-Z & a-z) and the * stands for at least 0. This means that there can be 0 or more word characters between the two t’s. Since \\w only contains word characters and not space characters there is also no need to worry that a string gets selected that has two t’s but in two different words. To get the above code to select words with exactly two t’s or exactly two e’s, the or operator is used (|) and the adapted code for the e is added.

Comment: Here also words with more than two t’s/e’s would be selected. Please adapt.

3. Select all the strings that end in exactly one digit.

grep("([^0-9]|\\b)[0-9]$", text_strings, value = TRUE)

## [1] "Ok ok 1 2 3 4 5 and the last one is 6"

With 0-9 all digits are considered. Alternatively one could use \\d. The [0-9]$ ensures that there is a digit at the end of the string. The ([^0-9]|\\b) before that ensures that the symbol before the last digit is no other digit, or that there is no symbol. The $ operator is ensuring that it must be at the end, and the [^] operator (important that it is in brackets) negates what is in the brackets. Here again the or with the \\b was used with the same logic as in 1. This was done to fit edge cases where the string is just a single digit.

4. Select all the strings that end in two digits.

grep("([^0-9]|\\b)[0-9]{2}$", text_strings, value = TRUE)

## [1] "We have to extract these numbers 12, 47, 48"

This code follows the same logic, just the {2} was added. This ensures that it are two digits and not one.

5. Select all the strings that have more than one capital letter.

grep("[A-Z](\\S|\\s)*[A-Z]", text_strings, value = TRUE)

## [1] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"

The most important part of this expression is the [A-Z]. This includes all capital letters. Because the task is to find all strings that have more than one capital letter, this is stated twice. The two code snippets to select the two (or more) capital letters is separated by (\\S|\\s). With this part all characters (no matter if non-space or space characters) are allowed. By adding the *, at least 0 characters must be inbetween the two capital letters.

6. Select all the strings that do not have the substring ‘nu’.

grep("nu", text_strings, value = TRUE)

## [1] "We have to extract these numbers 12, 47, 48"                                                                                                          
## [2] "The integers numbers are also interestings: 189 2036 314"                                                                                             
## [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"                                                             
## [4] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
## [5] "{366,78,90,5}Yes, there are only 4 numbers inside"                                                                                                    
## [6] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"

This is pretty self explanatory.

7. Select all the strings that contain special characters such as: [, €, and so on

grep("[^[:alnum:]\\s]", text_strings, value = TRUE, perl = TRUE)

##  [1] "We have to extract these numbers 12, 47, 48"                                                                                                          
##  [2] "The integers numbers are also interestings: 189 2036 314"                                                                                             
##  [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"                                                             
##  [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"                                                              
##  [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."                                                     
##  [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
##  [7] "Writing 1 example is not funny, please consider that 66% is validation+testing"                                                                       
##  [8] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"                                                                         
##  [9] "Who loves arrays more than me?"                                                                                                                       
## [10] "{366,78,90,5}Yes, there are only 4 numbers inside"                                                                                                    
## [11] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"                                                                              
## [12] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"

This code consists of two main elements: [:alnum:] and \\s. Both of these elements are in brackets with the negation sign ^ before them. This means that regular expressions are searched for that include a non alpha-numerical character that is not a space.

8. Select all the strings that do not contain special characters.

grep("[^[:alnum:]\\s]", text_strings, value = TRUE, perl = TRUE, invert = TRUE)

## [1] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"
## [2] "Integers are fine but sometimes you like 99 cents after the 99 dollars"  
## [3] "100 euro are better than 99 euro"                                        
## [4] "Ok ok 1 2 3 4 5 and the last one is 6"                                   
## [5] "hetot hhats"                                                             
## [6] "hetlltstf"

For this we used the same statement as above but added the argument invert = TRUE. Using this, we can invert the code and whereas it gave as all strings containing a special character in task 7 it now gives us all strings that do not contain a special character.

9. Select all the strings that end with a punctuation mark.

grep("[[:punct:]]$", text_strings, value = TRUE, perl = TRUE)

## [1] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."                                                     
## [2] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
## [3] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"                                                                         
## [4] "Who loves arrays more than me?"                                                                                                                       
## [5] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"

Here the dollar sign is used to refer to the end of the string. The perl code [[:punct:]] is used to all punctuation marks.

10. Select all the strings that start with a lowercase letter.

grep("^[a-z]", text_strings, value = TRUE)

## [1] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"
## [2] "hetot hhats"                                                             
## [3] "hetlltstf"

Instead of the $ sign for indicating the end of a string we can use the ^ sign to indicate the start of a string. In this case the ^ is followed by [a-z]. This means that the beginning of a string must be a letter that is in the range of lowercase a to lowercase z.

Selection:

1. Add a punctuation mark to all the strings that end without one.

sub("([A-Za-z0-9])$", "\\1.", text_strings)

##  [1] "We have to extract these numbers 12, 47, 48."                                                                                                         
##  [2] "The integers numbers are also interestings: 189 2036 314."                                                                                            
##  [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456."                                                            
##  [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking."                                                             
##  [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."                                                     
##  [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
##  [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar."                                                                            
##  [8] "Writing 1 example is not funny, please consider that 66% is validation+testing."                                                                      
##  [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"                                                                         
## [10] "Who loves arrays more than me?"                                                                                                                       
## [11] "{366,78,90,5}Yes, there are only 4 numbers inside."                                                                                                   
## [12] "Integers are fine but sometimes you like 99 cents after the 99 dollars."                                                                              
## [13] "100 euro are better than 99 euro."                                                                                                                    
## [14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"                                                                              
## [15] "Ok ok 1 2 3 4 5 and the last one is 6."                                                                                                               
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando."                                                                       
## [17] "hetot hhats."                                                                                                                                         
## [18] "hetlltstf."

Here we first check if the last letter of a string is alphanumeric. This is done by [A-Za-z0-9]$. The A-Z contains all upper case letters, the a-z contains all lower case letters, and the 0-9 contains all digits. The dollar sign after this expression is square brackets indicates that it has to be at the end of the string. Furthermore, we put parentheses around it so we can treat is as a group, in this case group 1.

In the replacement argument of our function now we wrote \\1.. This argument is build up of two elements. The first is the reference back to the group (\\1). With this it is made sure that the last letter of a string that fits the pattern is replaced by itself (i.e., it looks like it is kept). The additional . is the punctuation mark that is added. This mark could also be exchanged for any other punctuation mark, however, it might has to be escapet with backwards slashes.

2. Replace the word ‘’dollars’’ with the symbol $.

gsub("dollars", "\\$", text_strings)

##  [1] "We have to extract these numbers 12, 47, 48"                                                                                                          
##  [2] "The integers numbers are also interestings: 189 2036 314"                                                                                             
##  [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"                                                             
##  [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"                                                              
##  [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."                                                     
##  [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
##  [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"                                                                             
##  [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"                                                                       
##  [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"                                                                         
## [10] "Who loves arrays more than me?"                                                                                                                       
## [11] "{366,78,90,5}Yes, there are only 4 numbers inside"                                                                                                    
## [12] "Integers are fine but sometimes you like 99 cents after the 99 $"                                                                                     
## [13] "100 euro are better than 99 euro"                                                                                                                     
## [14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"                                                                              
## [15] "Ok ok 1 2 3 4 5 and the last one is 6"                                                                                                                
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"                                                                        
## [17] "hetot hhats"                                                                                                                                          
## [18] "hetlltstf"

Here the gsub() function is used in case the word dollars is used more than once in a string. Otherwise, this is a very simple replacement of the word dollars by the dollar sign.

3. Replace all the euro words with the euro symbol €.

gsub("euro", "€", text_strings)

##  [1] "We have to extract these numbers 12, 47, 48"                                                                                                          
##  [2] "The integers numbers are also interestings: 189 2036 314"                                                                                             
##  [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"                                                             
##  [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"                                                              
##  [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."                                                     
##  [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
##  [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"                                                                             
##  [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"                                                                       
##  [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"                                                                         
## [10] "Who loves arrays more than me?"                                                                                                                       
## [11] "{366,78,90,5}Yes, there are only 4 numbers inside"                                                                                                    
## [12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"                                                                               
## [13] "100 € are better than 99 €"                                                                                                                           
## [14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"                                                                              
## [15] "Ok ok 1 2 3 4 5 and the last one is 6"                                                                                                                
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"                                                                        
## [17] "hetot hhats"                                                                                                                                          
## [18] "hetlltstf"

Here the same logic is applied as above. However, no escape signs (\\) had to be used, since the € sign is not a special characters.

4. Replace all the strings with words that repeat themselves consecutively with words that have a single occurrence. For example, replace ‘’Bye Bye’’ with ‘’Bye’’.

gsub("\\b(\\w+)\\b\\s\\b\\1\\b", "\\1", text_strings)

##  [1] "We have to extract these numbers 12, 47, 48"                                                                                                          
##  [2] "The integers numbers are also interestings: 189 2036 314"                                                                                             
##  [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"                                                             
##  [4] "We like to offer you 7890$ per month in order to complete this task... we are joking"                                                                 
##  [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."                                                     
##  [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
##  [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"                                                                             
##  [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"                                                                       
##  [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"                                                                         
## [10] "Who loves arrays more than me?"                                                                                                                       
## [11] "{366,78,90,5}Yes, there are only 4 numbers inside"                                                                                                    
## [12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"                                                                               
## [13] "100 euro are better than 99 euro"                                                                                                                     
## [14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"                                                                              
## [15] "Ok ok 1 2 3 4 5 and the last one is 6"                                                                                                                
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"                                                                        
## [17] "hetot hhats"                                                                                                                                          
## [18] "hetlltstf"

The main part of the code is the (\\w+) and the \\1. The (\\w+) creates a group that contains a word of at least on letter. The \\1 is a reference back to this group, which means that the code applies when the word is repeated. The \\b in the beginning and end make sure that the code is only deleting full words that occur twice and not words that contain the same elements by coincidence (e.g., in “There is a hat at” the second “at” is not deleted, because it is not fully match the “hat”). Lastly, the \\s in the middle must be added to show that there is a space between the two words and words like “coco” are not affected. The \\1 in the replacement argument again refers back to the group that was created.

5. Replace all the digits that are separated between them by a space with a comma.

x <- gsub("(\\d)(\\s)(\\d)", "\\1\\, \\3", text_strings)
gsub("(\\d)(\\s)(\\d)", "\\1\\, \\3", x)

##  [1] "We have to extract these numbers 12, 47, 48"                                                                                                          
##  [2] "The integers numbers are also interestings: 189, 2036, 314"                                                                                           
##  [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"                                                             
##  [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"                                                              
##  [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."                                                     
##  [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
##  [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"                                                                             
##  [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"                                                                       
##  [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"                                                                         
## [10] "Who loves arrays more than me?"                                                                                                                       
## [11] "{366,78,90,5}Yes, there are only 4 numbers inside"                                                                                                    
## [12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"                                                                               
## [13] "100 euro are better than 99 euro"                                                                                                                     
## [14] "I like to give you 1000 numbers now: 12, 3, 56, 21, 67, and more, [45,67,7]"                                                                          
## [15] "Ok ok 1, 2, 3, 4, 5 and the last one is 6"                                                                                                            
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"                                                                        
## [17] "hetot hhats"                                                                                                                                          
## [18] "hetlltstf"

The pattern used to find digits that are separated between them by a space was (\\d)(\\s)(\\d). This pattern consists of three groups. The first one (\\d) represents the first digit to match, the second one (\\s) represents the space between the digits, and the third one (\\d) represents the second digit. In the replace argument the first and the third group are called, with the desired comma and space between them. The issue with this code is if there are multiple digits separated by a space (e.g., 1 2 3 4). Here the code above would find two fits (e.g., 1 2 and 3 4) and only put a comma and a space between those (e.g., 1, 2 3, 4). However, by running the same code again we get the desired result.

6. Replace all the digits that are separated between them by a comma with a comma followed by a space.

gsub("(\\d),(\\d)", "\\1, \\2", text_strings)

##  [1] "We have to extract these numbers 12, 47, 48"                                                                                                          
##  [2] "The integers numbers are also interestings: 189 2036 314"                                                                                             
##  [3] "','is a separator, so please extract these numbers 125, 789, 1450 and also these 564, 90456"                                                          
##  [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"                                                              
##  [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."                                                     
##  [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
##  [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"                                                                             
##  [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"                                                                       
##  [9] "You you are a genius, I think that you like arrays A LOT, [3, 45, 67, 900, 1974]"                                                                     
## [10] "Who loves arrays more than me?"                                                                                                                       
## [11] "{366, 78, 90, 5}Yes, there are only 4 numbers inside"                                                                                                 
## [12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"                                                                               
## [13] "100 euro are better than 99 euro"                                                                                                                     
## [14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45, 67, 7]"                                                                            
## [15] "Ok ok 1 2 3 4 5 and the last one is 6"                                                                                                                
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"                                                                        
## [17] "hetot hhats"                                                                                                                                          
## [18] "hetlltstf"

Once again, two groups are created so we can replace those by themselves. In this case the groups are the digit before the comma and the digit after the comma. Those two groups are separated by a comma, hence we have the expression (\\d),(\\d). In the replacement argument we then call the groups individually and put the , and a space between them.

7. Replace all the words that start with a lowercase letter with words that start with an uppercase letter.

gsub("\\b([a-z])(\\w*)", "\\U\\1\\E\\2", text_strings, perl=TRUE)

##  [1] "We Have To Extract These Numbers 12, 47, 48"                                                                                                          
##  [2] "The Integers Numbers Are Also Interestings: 189 2036 314"                                                                                             
##  [3] "','Is A Separator, So Please Extract These Numbers 125,789,1450 And Also These 564,90456"                                                             
##  [4] "We Like To To Offer You 7890$ Per Month In Order To Complete This Task... We Are Joking"                                                              
##  [5] "You Are Going To Learn 3 Things, The First One Is Not To Extract, And 2 And 3 Are Simply Digits."                                                     
##  [6] "Have Fun With Our Mighty Test, You Are Going To Support Science, Progress, Mankind Wellness And You Are Going To Waste 30 Or 60 Minutes Of Your Life."
##  [7] "You Can Also Extract Exotic Stuff Like A456 Gb67 And 45678911ghth Dollar"                                                                             
##  [8] "Writing 1 Example Is Not Funny, Please Consider That 66% Is Validation+Testing"                                                                       
##  [9] "You You Are A Genius, I Think That You Like Arrays A LOT, [3,45,67,900,1974]"                                                                         
## [10] "Who Loves Arrays More Than Me?"                                                                                                                       
## [11] "{366,78,90,5}Yes, There Are Only 4 Numbers Inside"                                                                                                    
## [12] "Integers Are Fine But Sometimes You Like 99 Cents After The 99 Dollars"                                                                               
## [13] "100 Euro Are Better Than 99 Euro"                                                                                                                     
## [14] "I Like To Give You 1000 Numbers Now: 12 3 56 21 67, And More, [45,67,7]"                                                                              
## [15] "Ok Ok 1 2 3 4 5 And The Last One Is 6"                                                                                                                
## [16] "33 Trentini Entrarono A Trento, Tutti E 33 Di Tratto In Tratto Trotterellando"                                                                        
## [17] "Hetot Hhats"                                                                                                                                          
## [18] "Hetlltstf"

The following pattern is used to find words that begin with an lower case letter: \\b([a-z])(\\w*). At the start of this pattern there is a \\b, indicating that it is the start of a word. After that the range [a-z] is defined to fit all words starting with a lower case. Then, \\w* is used to fit all the letters that could follow the starting letter. The * is there so also words that have only one letter (e.g., “I”) will be fitted. Furthermore, the starting letter of the word, as well as the following letters were put in two groups. This is so we can modify them and replace them by themselves later. The modification we did was to modify group 1, e.g., the starting letter. We put the perl commands \\U and \\E around it to indicate that it should be in upper case. We did no modification to group 2.

8. Add a dollar sign after all the strings that finish with a dot.

gsub("(\\.)$", "\\1\\$", text_strings, perl=TRUE)

##  [1] "We have to extract these numbers 12, 47, 48"                                                                                                           
##  [2] "The integers numbers are also interestings: 189 2036 314"                                                                                              
##  [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"                                                              
##  [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"                                                               
##  [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits.$"                                                     
##  [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life.$"
##  [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"                                                                              
##  [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"                                                                        
##  [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"                                                                          
## [10] "Who loves arrays more than me?"                                                                                                                        
## [11] "{366,78,90,5}Yes, there are only 4 numbers inside"                                                                                                     
## [12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"                                                                                
## [13] "100 euro are better than 99 euro"                                                                                                                      
## [14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"                                                                               
## [15] "Ok ok 1 2 3 4 5 and the last one is 6"                                                                                                                 
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"                                                                         
## [17] "hetot hhats"                                                                                                                                           
## [18] "hetlltstf"

Here we find the strings that finish with a dot by adding the $ sing behind the escaped do \\.. We replace this by the dot itself and by a escaped dollar sing.

9. Add a caret sign to all the strings that end with a dot, exactly before that dot.

gsub("(\\.)$", "\\^\\1", text_strings, perl=TRUE)

##  [1] "We have to extract these numbers 12, 47, 48"                                                                                                           
##  [2] "The integers numbers are also interestings: 189 2036 314"                                                                                              
##  [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"                                                              
##  [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"                                                               
##  [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits^."                                                     
##  [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life^."
##  [7] "you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"                                                                              
##  [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"                                                                        
##  [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"                                                                          
## [10] "Who loves arrays more than me?"                                                                                                                        
## [11] "{366,78,90,5}Yes, there are only 4 numbers inside"                                                                                                     
## [12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"                                                                                
## [13] "100 euro are better than 99 euro"                                                                                                                      
## [14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"                                                                               
## [15] "Ok ok 1 2 3 4 5 and the last one is 6"                                                                                                                 
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"                                                                         
## [17] "hetot hhats"                                                                                                                                           
## [18] "hetlltstf"

Here we use the same pattern as above again. That we grouped the dot sign allows us to rearrange it in our replacement pattern. Therefore, we can just add the escaped ^ before our call of group one \\1.

10. Add a caret sign in front of all the strings that start with a lowercase letter.

gsub("^\\b([a-z])", "\\^\\1", text_strings, perl=TRUE)

##  [1] "We have to extract these numbers 12, 47, 48"                                                                                                          
##  [2] "The integers numbers are also interestings: 189 2036 314"                                                                                             
##  [3] "','is a separator, so please extract these numbers 125,789,1450 and also these 564,90456"                                                             
##  [4] "We like to to offer you 7890$ per month in order to complete this task... we are joking"                                                              
##  [5] "You are going to learn 3 things, the first one is not to extract, and 2 and 3 are simply digits."                                                     
##  [6] "Have fun with our mighty test, you are going to support science, progress, mankind wellness and you are going to waste 30 or 60 minutes of your life."
##  [7] "^you can also extract exotic stuff like a456 gb67 and 45678911ghth dollar"                                                                            
##  [8] "Writing 1 example is not funny, please consider that 66% is validation+testing"                                                                       
##  [9] "You you are a genius, I think that you like arrays A LOT, [3,45,67,900,1974]"                                                                         
## [10] "Who loves arrays more than me?"                                                                                                                       
## [11] "{366,78,90,5}Yes, there are only 4 numbers inside"                                                                                                    
## [12] "Integers are fine but sometimes you like 99 cents after the 99 dollars"                                                                               
## [13] "100 euro are better than 99 euro"                                                                                                                     
## [14] "I like to give you 1000 numbers now: 12 3 56 21 67, and more, [45,67,7]"                                                                              
## [15] "Ok ok 1 2 3 4 5 and the last one is 6"                                                                                                                
## [16] "33 trentini entrarono a Trento, tutti e 33 di tratto in tratto trotterellando"                                                                        
## [17] "^hetot hhats"                                                                                                                                         
## [18] "^hetlltstf"

Our search pattern looks like this: ^\\b([a-z]). The caret sign in the search pattern makes sure that we are searching at the start of a string. The additional \\b makes sure that we are looking at the start of the word, however, is not really necessary. The grouped range from a to z makes sure that we only include lower case letters as first letter. Grouping this first letter allows us to call it later on. The replace pattern consists of the escaped caret sign and the call of group 1.

Exercise 1

March 2024

Import the Dataset

Selection:

1. Select all the strings that have words that contain exactly two consecutive t’s.

2. Select all the strings that have words that contain exactly two t’s (not necessarily consecutive) or exactly two e’s.

3. Select all the strings that end in exactly one digit.

4. Select all the strings that end in two digits.

5. Select all the strings that have more than one capital letter.

6. Select all the strings that do not have the substring ‘nu’.

7. Select all the strings that contain special characters such as: [, €, and so on

8. Select all the strings that do not contain special characters.

9. Select all the strings that end with a punctuation mark.

10. Select all the strings that start with a lowercase letter.

Selection:

1. Add a punctuation mark to all the strings that end without one.

2. Replace the word ‘’dollars’’ with the symbol $.

3. Replace all the euro words with the euro symbol €.

4. Replace all the strings with words that repeat themselves consecutively with words that have a single occurrence. For example, replace ‘’Bye Bye’’ with ‘’Bye’’.

5. Replace all the digits that are separated between them by a space with a comma.

6. Replace all the digits that are separated between them by a comma with a comma followed by a space.

7. Replace all the words that start with a lowercase letter with words that start with an uppercase letter.

8. Add a dollar sign after all the strings that finish with a dot.

9. Add a caret sign to all the strings that end with a dot, exactly before that dot.

10. Add a caret sign in front of all the strings that start with a lowercase letter.