✨ Introduction

This document contains the first exercise in Text Mining, focusing on: - Regular expressions - Text cleaning - Preprocessing tasks using the stringr package


πŸ“˜ Exercise 1: Pattern Matching

We start with a simple vector of strings and use str_view() to identify patterns.

# Vector of strings is given
vector <- c("emoticon", ":)", "symbol", "$^$")
writeLines((vector))
## emoticon
## :)
## symbol
## $^$

a) string of 3 characters with the letter o in the middle

str_view(vector, "^.o.$", match = TRUE)
# nothing

b) expression β€œemoticon”

str_view(vector, "emoticon", match = TRUE)
## [1] β”‚ <emoticon>

c) expression β€œ:)”

str_view(vector, ":\\)", match = TRUE)
## [2] β”‚ <:)>

d) expression β€œ\(^\)”

str_view(vector, "\\$\\^\\$", match = TRUE)
## [4] β”‚ <$^$>

🎯 Exercise 2: Detecting Patterns in a Corpus

Here we use str_view() to detect certain features in social media–style text data.

corpus <- c(
  "OMG I looove this movie!!! :D :) #cinema",
  "Visit https://data.org for more info!",
  "@user123 LOL that's crazy XD XD XD",
  "Email me at test_user@mail.com ASAP!!",
  "Working from HOME since 2020...",
  "BUY NOW!!! Only $9.99!",
  "Great job!!! Keep it up :) ;D",
  "So tired of this traffic jam... #monday",
  "New blog post https://myblog.net/post/123",
  "Check this out @friend β€” unbelievable!!!",
  "SALE starts TODAY!!! LIMITED TIME OFFER!",
  "Working hard or hardly working?",
  "Can't believe it's already 2025!!",
  "Follow us for updates @data_science_team",
  "Contact: info@company.com for details.",
  "RT @newsbot: BREAKING NEWS: Market hits new record highs!!!",
  "LOL this made my day :P ;P",
  "Nothing better than cooooffee in the mooorning"
)

a) all posts containing a URL

(to identify links for removal during preprocessing)

str_view(corpus, "https?://[^ ]+", match = TRUE)
## [2] β”‚ Visit <https://data.org> for more info!
## [9] β”‚ New blog post <https://myblog.net/post/123>

b) all posts containing user mentions (starting with @)

(to identify usernames for anonymization or network analysis)

str_view(corpus, "@[A-Za-z0-9_]+", match = TRUE)
##  [3] β”‚ <@user123> LOL that's crazy XD XD XD
##  [4] β”‚ Email me at test_user<@mail>.com ASAP!!
## [10] β”‚ Check this out <@friend> β€” unbelievable!!!
## [14] β”‚ Follow us for updates <@data_science_team>
## [15] β”‚ Contact: info<@company>.com for details.
## [16] β”‚ RT <@newsbot>: BREAKING NEWS: Market hits new record highs!!!

c) all posts that have a sequence of >=3 uppercase words

(to identify potential shouting or emphasis)

str_view(corpus, "(\\b[A-Z]{2,}\\b[[:space:]]*){3,}", match = TRUE)
##  [3] β”‚ @user123 LOL that's crazy <XD XD XD>
## [11] β”‚ SALE starts TODAY!!! <LIMITED TIME OFFER>!

d) all posts where any letter repeats 3 or more times

(to identify potential emotion, e.g.Β β€œsooooo happy”)

str_view(corpus, "([A-Za-z])\\1{2,}", match = TRUE)
##  [1] β”‚ OMG I l<ooo>ve this movie!!! :D :) #cinema
## [18] β”‚ Nothing better than c<oooo>ffee in the m<ooo>rning

🧹 Exercise 3: Cleaning Text Data

Now we apply str_remove_all() to clean the corpus.

corpus <- c(
  "OMG I looove this movie!!! :D :) #cinema",
  "Visit https://data.org for more info!",
  "@user123 LOL that's crazy XD XD XD",
  "Email me at test_user@mail.com ASAP!!",
  "Working from HOME since 2020...",
  "BUY NOW!!! Only $9.99!",
  "Great job!!! Keep it up :) ;D",
  "So tired of this traffic jam... #monday",
  "New blog post https://myblog.net/post/123",
  "Check this out @friend β€” unbelievable!!!",
  "SALE starts TODAY!!! LIMITED TIME OFFER!",
  "Working hard or hardly working?",
  "Can't believe it's already 2025!!",
  "Follow us for updates @data_science_team",
  "Contact: info@company.com for details.",
  "RT @newsbot: BREAKING NEWS: Market hits new record highs!!!",
  "LOL this made my day :P ;P",
  "Nothing better than cooooffee in the mooorning"
)

a) Remove all words that contain a number

(years, prices, IDs that need cleaning)

corpus_a <- str_remove_all(corpus, "\\b\\w*\\d+\\w*\\b")
corpus_a
##  [1] "OMG I looove this movie!!! :D :) #cinema"                   
##  [2] "Visit https://data.org for more info!"                      
##  [3] "@ LOL that's crazy XD XD XD"                                
##  [4] "Email me at test_user@mail.com ASAP!!"                      
##  [5] "Working from HOME since ..."                                
##  [6] "BUY NOW!!! Only $.!"                                        
##  [7] "Great job!!! Keep it up :) ;D"                              
##  [8] "So tired of this traffic jam... #monday"                    
##  [9] "New blog post https://myblog.net/post/"                     
## [10] "Check this out @friend β€” unbelievable!!!"                   
## [11] "SALE starts TODAY!!! LIMITED TIME OFFER!"                   
## [12] "Working hard or hardly working?"                            
## [13] "Can't believe it's already !!"                              
## [14] "Follow us for updates @data_science_team"                   
## [15] "Contact: info@company.com for details."                     
## [16] "RT @newsbot: BREAKING NEWS: Market hits new record highs!!!"
## [17] "LOL this made my day :P ;P"                                 
## [18] "Nothing better than cooooffee in the mooorning"

b) Remove all words that are written entirely in uppercase letters

(removes shouting or acronyms)

corpus_a <- str_remove_all(corpus, "\\b[A-Z]{2,}\\b")
corpus_a
##  [1] " I looove this movie!!! :D :) #cinema"         
##  [2] "Visit https://data.org for more info!"         
##  [3] "@user123  that's crazy   "                     
##  [4] "Email me at test_user@mail.com !!"             
##  [5] "Working from  since 2020..."                   
##  [6] " !!! Only $9.99!"                              
##  [7] "Great job!!! Keep it up :) ;D"                 
##  [8] "So tired of this traffic jam... #monday"       
##  [9] "New blog post https://myblog.net/post/123"     
## [10] "Check this out @friend β€” unbelievable!!!"      
## [11] " starts !!!   !"                               
## [12] "Working hard or hardly working?"               
## [13] "Can't believe it's already 2025!!"             
## [14] "Follow us for updates @data_science_team"      
## [15] "Contact: info@company.com for details."        
## [16] " @newsbot:  : Market hits new record highs!!!" 
## [17] " this made my day :P ;P"                       
## [18] "Nothing better than cooooffee in the mooorning"

c) Remove all hashtags

corpus_a <- str_remove_all(corpus, "#\\w+")
corpus_a
##  [1] "OMG I looove this movie!!! :D :) "                          
##  [2] "Visit https://data.org for more info!"                      
##  [3] "@user123 LOL that's crazy XD XD XD"                         
##  [4] "Email me at test_user@mail.com ASAP!!"                      
##  [5] "Working from HOME since 2020..."                            
##  [6] "BUY NOW!!! Only $9.99!"                                     
##  [7] "Great job!!! Keep it up :) ;D"                              
##  [8] "So tired of this traffic jam... "                           
##  [9] "New blog post https://myblog.net/post/123"                  
## [10] "Check this out @friend β€” unbelievable!!!"                   
## [11] "SALE starts TODAY!!! LIMITED TIME OFFER!"                   
## [12] "Working hard or hardly working?"                            
## [13] "Can't believe it's already 2025!!"                          
## [14] "Follow us for updates @data_science_team"                   
## [15] "Contact: info@company.com for details."                     
## [16] "RT @newsbot: BREAKING NEWS: Market hits new record highs!!!"
## [17] "LOL this made my day :P ;P"                                 
## [18] "Nothing better than cooooffee in the mooorning"

d) Remove all smiley-style emojis like :), :D, :P, :-)

corpus_a <- str_remove_all(corpus, "[:;]-?[DPdp)(]+")
corpus_a
##  [1] "OMG I looove this movie!!!   #cinema"                       
##  [2] "Visit https://data.org for more info!"                      
##  [3] "@user123 LOL that's crazy XD XD XD"                         
##  [4] "Email me at test_user@mail.com ASAP!!"                      
##  [5] "Working from HOME since 2020..."                            
##  [6] "BUY NOW!!! Only $9.99!"                                     
##  [7] "Great job!!! Keep it up  "                                  
##  [8] "So tired of this traffic jam... #monday"                    
##  [9] "New blog post https://myblog.net/post/123"                  
## [10] "Check this out @friend β€” unbelievable!!!"                   
## [11] "SALE starts TODAY!!! LIMITED TIME OFFER!"                   
## [12] "Working hard or hardly working?"                            
## [13] "Can't believe it's already 2025!!"                          
## [14] "Follow us for updates @data_science_team"                   
## [15] "Contact: info@company.com for details."                     
## [16] "RT @newsbot: BREAKING NEWS: Market hits new record highs!!!"
## [17] "LOL this made my day  "                                     
## [18] "Nothing better than cooooffee in the mooorning"

Apply all a), b), c), d) tasks to obtain corpus_CLEAN

corpus_CLEAN <- corpus |>
  str_remove_all("\\b\\w*\\d+\\w*\\b") |>     # (a)
  str_remove_all("\\b[A-Z]{2,}\\b")   |>     # (b)
  str_remove_all("#\\w+")              |>     # (c)
  str_remove_all("[:;]-?[DPdp)(]+")          # (d)

corpus_CLEAN
##  [1] " I looove this movie!!!   "                    
##  [2] "Visit https://data.org for more info!"         
##  [3] "@  that's crazy   "                            
##  [4] "Email me at test_user@mail.com !!"             
##  [5] "Working from  since ..."                       
##  [6] " !!! Only $.!"                                 
##  [7] "Great job!!! Keep it up  "                     
##  [8] "So tired of this traffic jam... "              
##  [9] "New blog post https://myblog.net/post/"        
## [10] "Check this out @friend β€” unbelievable!!!"      
## [11] " starts !!!   !"                               
## [12] "Working hard or hardly working?"               
## [13] "Can't believe it's already !!"                 
## [14] "Follow us for updates @data_science_team"      
## [15] "Contact: info@company.com for details."        
## [16] " @newsbot:  : Market hits new record highs!!!" 
## [17] " this made my day  "                           
## [18] "Nothing better than cooooffee in the mooorning"