This document contains the first exercise in Text
Mining, focusing on: - Regular expressions - Text cleaning -
Preprocessing tasks using the stringr package
We start with a simple vector of strings and use
str_view() to identify patterns.
## emoticon
## :)
## symbol
## $^$
Here we use str_view() to detect certain features in social mediaβstyle text data.
corpus <- c(
"OMG I looove this movie!!! :D :) #cinema",
"Visit https://data.org for more info!",
"@user123 LOL that's crazy XD XD XD",
"Email me at test_user@mail.com ASAP!!",
"Working from HOME since 2020...",
"BUY NOW!!! Only $9.99!",
"Great job!!! Keep it up :) ;D",
"So tired of this traffic jam... #monday",
"New blog post https://myblog.net/post/123",
"Check this out @friend β unbelievable!!!",
"SALE starts TODAY!!! LIMITED TIME OFFER!",
"Working hard or hardly working?",
"Can't believe it's already 2025!!",
"Follow us for updates @data_science_team",
"Contact: info@company.com for details.",
"RT @newsbot: BREAKING NEWS: Market hits new record highs!!!",
"LOL this made my day :P ;P",
"Nothing better than cooooffee in the mooorning"
)(to identify links for removal during preprocessing)
## [2] β Visit <https://data.org> for more info!
## [9] β New blog post <https://myblog.net/post/123>
(to identify usernames for anonymization or network analysis)
## [3] β <@user123> LOL that's crazy XD XD XD
## [4] β Email me at test_user<@mail>.com ASAP!!
## [10] β Check this out <@friend> β unbelievable!!!
## [14] β Follow us for updates <@data_science_team>
## [15] β Contact: info<@company>.com for details.
## [16] β RT <@newsbot>: BREAKING NEWS: Market hits new record highs!!!
(to identify potential shouting or emphasis)
## [3] β @user123 LOL that's crazy <XD XD XD>
## [11] β SALE starts TODAY!!! <LIMITED TIME OFFER>!
Now we apply str_remove_all() to clean the corpus.
corpus <- c(
"OMG I looove this movie!!! :D :) #cinema",
"Visit https://data.org for more info!",
"@user123 LOL that's crazy XD XD XD",
"Email me at test_user@mail.com ASAP!!",
"Working from HOME since 2020...",
"BUY NOW!!! Only $9.99!",
"Great job!!! Keep it up :) ;D",
"So tired of this traffic jam... #monday",
"New blog post https://myblog.net/post/123",
"Check this out @friend β unbelievable!!!",
"SALE starts TODAY!!! LIMITED TIME OFFER!",
"Working hard or hardly working?",
"Can't believe it's already 2025!!",
"Follow us for updates @data_science_team",
"Contact: info@company.com for details.",
"RT @newsbot: BREAKING NEWS: Market hits new record highs!!!",
"LOL this made my day :P ;P",
"Nothing better than cooooffee in the mooorning"
)(years, prices, IDs that need cleaning)
## [1] "OMG I looove this movie!!! :D :) #cinema"
## [2] "Visit https://data.org for more info!"
## [3] "@ LOL that's crazy XD XD XD"
## [4] "Email me at test_user@mail.com ASAP!!"
## [5] "Working from HOME since ..."
## [6] "BUY NOW!!! Only $.!"
## [7] "Great job!!! Keep it up :) ;D"
## [8] "So tired of this traffic jam... #monday"
## [9] "New blog post https://myblog.net/post/"
## [10] "Check this out @friend β unbelievable!!!"
## [11] "SALE starts TODAY!!! LIMITED TIME OFFER!"
## [12] "Working hard or hardly working?"
## [13] "Can't believe it's already !!"
## [14] "Follow us for updates @data_science_team"
## [15] "Contact: info@company.com for details."
## [16] "RT @newsbot: BREAKING NEWS: Market hits new record highs!!!"
## [17] "LOL this made my day :P ;P"
## [18] "Nothing better than cooooffee in the mooorning"
(removes shouting or acronyms)
## [1] " I looove this movie!!! :D :) #cinema"
## [2] "Visit https://data.org for more info!"
## [3] "@user123 that's crazy "
## [4] "Email me at test_user@mail.com !!"
## [5] "Working from since 2020..."
## [6] " !!! Only $9.99!"
## [7] "Great job!!! Keep it up :) ;D"
## [8] "So tired of this traffic jam... #monday"
## [9] "New blog post https://myblog.net/post/123"
## [10] "Check this out @friend β unbelievable!!!"
## [11] " starts !!! !"
## [12] "Working hard or hardly working?"
## [13] "Can't believe it's already 2025!!"
## [14] "Follow us for updates @data_science_team"
## [15] "Contact: info@company.com for details."
## [16] " @newsbot: : Market hits new record highs!!!"
## [17] " this made my day :P ;P"
## [18] "Nothing better than cooooffee in the mooorning"
## [1] "OMG I looove this movie!!! #cinema"
## [2] "Visit https://data.org for more info!"
## [3] "@user123 LOL that's crazy XD XD XD"
## [4] "Email me at test_user@mail.com ASAP!!"
## [5] "Working from HOME since 2020..."
## [6] "BUY NOW!!! Only $9.99!"
## [7] "Great job!!! Keep it up "
## [8] "So tired of this traffic jam... #monday"
## [9] "New blog post https://myblog.net/post/123"
## [10] "Check this out @friend β unbelievable!!!"
## [11] "SALE starts TODAY!!! LIMITED TIME OFFER!"
## [12] "Working hard or hardly working?"
## [13] "Can't believe it's already 2025!!"
## [14] "Follow us for updates @data_science_team"
## [15] "Contact: info@company.com for details."
## [16] "RT @newsbot: BREAKING NEWS: Market hits new record highs!!!"
## [17] "LOL this made my day "
## [18] "Nothing better than cooooffee in the mooorning"
corpus_CLEAN <- corpus |>
str_remove_all("\\b\\w*\\d+\\w*\\b") |> # (a)
str_remove_all("\\b[A-Z]{2,}\\b") |> # (b)
str_remove_all("#\\w+") |> # (c)
str_remove_all("[:;]-?[DPdp)(]+") # (d)
corpus_CLEAN## [1] " I looove this movie!!! "
## [2] "Visit https://data.org for more info!"
## [3] "@ that's crazy "
## [4] "Email me at test_user@mail.com !!"
## [5] "Working from since ..."
## [6] " !!! Only $.!"
## [7] "Great job!!! Keep it up "
## [8] "So tired of this traffic jam... "
## [9] "New blog post https://myblog.net/post/"
## [10] "Check this out @friend β unbelievable!!!"
## [11] " starts !!! !"
## [12] "Working hard or hardly working?"
## [13] "Can't believe it's already !!"
## [14] "Follow us for updates @data_science_team"
## [15] "Contact: info@company.com for details."
## [16] " @newsbot: : Market hits new record highs!!!"
## [17] " this made my day "
## [18] "Nothing better than cooooffee in the mooorning"