Understand the task of supervised machine learning, and learn about feature representation
Learn about the way in which textual data are applied to machine learning algorithms
Introduce tidy data principles and see how to make data tidy with the functions from the magrittr
and dplyr
packages.
See how the tidytext
package applies tidy data principles to text via the unnest_tokens()
function.
Here, tidy text is “a table with one-token-per-row”. A token is a whatever unit of text is meaningful for our analysis: it could be a word, a word pair, a phrase, a sentence, etc.
This means that the process of getting text data tidy is largely a matter of
Deciding what the level of the analysis is going to be - what the “token” is.
Splitting the text into tokens, a process called tokenization.
Let’s take a look at tokenization in more detail.
unnest_tokens()
The tidytext
package provides a very useful took for tokenization: unnest_tokens()
The unnest_tokens()
function performs tokenization by splitting each text line up into the required tokens, and creating a new data frame with one token per row, that is, tidy text data.
Suppose we want to analyze the individual words in tweets including the hashtags about COVID-19. We do this by
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytext)
load("covid19_tweets_df.RData")
covid19_tweets_df %>%
unnest_tokens(word, text, token="words")
## # A tibble: 28,129,247 x 6
## user_id status_id created_at screen_name name word
## <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… fascina…
## 2 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… news
## 3 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… in
## 4 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… england
## 5 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… uk
## 6 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… firms
## 7 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… and
## 8 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… academi…
## 9 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… have
## 10 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… also
## # … with 28,129,237 more rows
Note the arguments passed to unnest_tokens()
: * the original data frame containing the text from covid19_tweets_df
* the variable name we want to create for the tokens in the new tidy data frame (word
) * the variable name where the text is stored in the original data frame (text
i.e. the tweets are in tweets_bts_eng$text
) * the unit of tokenization (token="words"
)
unnest_tweets()
A really nice feature of tidytext is that we can tokenize tweets. This means digital features are considered when splittitng string in a column into one-token-per-row. For example, for analyzing tweets, we can explicitly include symbols like @
and #
that mean something, while URLs are also kept as a single string.
tweets <- tibble(
id = 1,
txt = "@rOpenSci and #rstats see: https://cran.r-project.org"
)
tweets
## # A tibble: 1 x 2
## id txt
## <dbl> <chr>
## 1 1 @rOpenSci and #rstats see: https://cran.r-project.org
tweets %>%
unnest_tokens(out, txt, token="words")
## # A tibble: 7 x 2
## id out
## <dbl> <chr>
## 1 1 ropensci
## 2 1 and
## 3 1 rstats
## 4 1 see
## 5 1 https
## 6 1 cran.r
## 7 1 project.org
# unnest_tokens function does not tokenize for Twitter handles (@), hashtags (#), or URLs,
tweets %>%
unnest_tweets(out, txt)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 5 x 2
## id out
## <dbl> <chr>
## 1 1 @ropensci
## 2 1 and
## 3 1 #rstats
## 4 1 see
## 5 1 https://cran.r-project.org
# Using unnest_tweets function, such digital features are taken into account.
We can apply the unnest_tweets
function to our covid19_tweets_df
, so we are now in a position to transform the full set of tweets into tidy text format. Before doing so, I want to keep only created_at
and text
variables, remove any dulicated text, and focus on tweets posted only on March 27th in 2020.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
covid19_tweets_df %>%
select(created_at, text) %>%
filter(!duplicated(text)) %>%
mutate(date = floor_date(created_at, unit="day")) %>%
filter(date == as.Date("2020-03-27")) %>%
unnest_tweets(word, text)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 12,774,915 x 3
## created_at date word
## <dttm> <dttm> <chr>
## 1 2020-03-27 04:28:33 2020-03-27 00:00:00 fascinating
## 2 2020-03-27 04:28:33 2020-03-27 00:00:00 news
## 3 2020-03-27 04:28:33 2020-03-27 00:00:00 in
## 4 2020-03-27 04:28:33 2020-03-27 00:00:00 england
## 5 2020-03-27 04:28:33 2020-03-27 00:00:00 uk
## 6 2020-03-27 04:28:33 2020-03-27 00:00:00 firms
## 7 2020-03-27 04:28:33 2020-03-27 00:00:00 and
## 8 2020-03-27 04:28:33 2020-03-27 00:00:00 academics
## 9 2020-03-27 04:28:33 2020-03-27 00:00:00 have
## 10 2020-03-27 04:28:33 2020-03-27 00:00:00 also
## # … with 12,774,905 more rows
# as.Date function coverts character representation of date into the object of class "Date" representing calendar dates.
covid19_tweets_tidy <- covid19_tweets_df %>%
select(created_at, text) %>%
filter(!duplicated(text)) %>%
mutate(date = floor_date(created_at, unit="day")) %>%
filter(date == as.Date("2020-03-27")) %>%
unnest_tweets(word, text)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
covid19_tweets_tidy
## # A tibble: 12,774,915 x 3
## created_at date word
## <dttm> <dttm> <chr>
## 1 2020-03-27 04:28:33 2020-03-27 00:00:00 fascinating
## 2 2020-03-27 04:28:33 2020-03-27 00:00:00 news
## 3 2020-03-27 04:28:33 2020-03-27 00:00:00 in
## 4 2020-03-27 04:28:33 2020-03-27 00:00:00 england
## 5 2020-03-27 04:28:33 2020-03-27 00:00:00 uk
## 6 2020-03-27 04:28:33 2020-03-27 00:00:00 firms
## 7 2020-03-27 04:28:33 2020-03-27 00:00:00 and
## 8 2020-03-27 04:28:33 2020-03-27 00:00:00 academics
## 9 2020-03-27 04:28:33 2020-03-27 00:00:00 have
## 10 2020-03-27 04:28:33 2020-03-27 00:00:00 also
## # … with 12,774,905 more rows
And count the word frequency
covid19_tweets_tidy %>%
count(word, sort = TRUE)
## # A tibble: 814,531 x 2
## word n
## <chr> <int>
## 1 the 452109
## 2 to 360861
## 3 of 232595
## 4 and 219941
## 5 covid19 195903
## 6 in 187403
## 7 a 177265
## 8 #covid19 169430
## 9 is 151050
## 10 for 150215
## # … with 814,521 more rows
There are still three problems to be resolved:
stopwords
package.library(stopwords)
stopwords() # returns the vector of stop words
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very" "will"
Using matching operator %in%
returns the logical vector for the elements of the vector A that are matched by the elements of the vector B: A %in% B
c("abc","bcd","123") %in% c("abc","efg","hlm")
## [1] TRUE FALSE FALSE
Numbers: We can remove a character pattern that does not contain any alphabet letter.
HTML (Hypertext Markup Language) tags: There are some HTML entity references (&
= &, <
= <, >
= >, "
= ") that can be matched by the following regex "&|<|>|""
covid19_tweets_df %>%
select(created_at, text) %>%
slice(12) %>%
unnest_tweets(word, text)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 16 x 2
## created_at word
## <dttm> <chr>
## 1 2020-03-27 04:19:13 the
## 2 2020-03-27 04:19:13 war
## 3 2020-03-27 04:19:13 against
## 4 2020-03-27 04:19:13 #covid19
## 5 2020-03-27 04:19:13 has
## 6 2020-03-27 04:19:13 to
## 7 2020-03-27 04:19:13 be
## 8 2020-03-27 04:19:13 won
## 9 2020-03-27 04:19:13 at
## 10 2020-03-27 04:19:13 home
## 11 2020-03-27 04:19:13 #stayhome
## 12 2020-03-27 04:19:13 amp
## 13 2020-03-27 04:19:13 stay
## 14 2020-03-27 04:19:13 safe
## 15 2020-03-27 04:19:13 #coronavirusoutbreak
## 16 2020-03-27 04:19:13 https://t.co/pbkonlwqey
"[^[:ascii]]+"
covid19_tweets_df %>%
select(created_at, text) %>%
slice(2) %>%
unnest_tweets(word, text)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 6 x 2
## created_at word
## <dttm> <chr>
## 1 2020-03-27 04:28:33 https://t.co/6zhx6m6rwx
## 2 2020-03-27 04:28:33 corona
## 3 2020-03-27 04:28:33 virus
## 4 2020-03-27 04:28:33 rhapsody
## 5 2020-03-27 04:28:33 🎧🎸🎼🎤
## 6 2020-03-27 04:28:33 #covid19
covid19_tweets_tidy <- covid19_tweets_df %>%
select(created_at, text) %>%
filter(!duplicated(text)) %>%
mutate(date = floor_date(created_at, unit="day")) %>%
filter(date == as.Date("2020-03-27")) %>%
mutate(text = str_replace_all(text, "[#@]?[^[:ascii:]]+", " ")) %>%
mutate(text = str_replace_all(text, "&|<|>|"|RT", " ")) %>%
unnest_tweets(word, text) %>%
filter(!word %in% stopwords()) %>%
filter(str_detect(word, "[a-z]"))
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
covid19_tweets_tidy %>% count(word, sort=T)
## # A tibble: 768,709 x 2
## word n
## <chr> <int>
## 1 covid19 196574
## 2 #covid19 170114
## 3 #coronavirus 107914
## 4 people 44139
## 5 s 42049
## 6 can 40212
## 7 us 39279
## 8 cases 37156
## 9 now 37062
## 10 coronavirus 28365
## # … with 768,699 more rows
wordcloud
library(wordcloud)
## Loading required package: RColorBrewer
set.seed(428) # set.seed is used to generate the word cloud with the same position of words by the number specified
covid19_tweets_tidy %>%
filter(str_detect(word, "^a[[:word:]]+")) %>%
count(word, sort=TRUE) %>%
with(wordcloud(words = word, # The with( ) function applys an expression to a dataset.
freq = n,
max.words = 200, # Maximum numbers of words plotted
random.order = FALSE, # Highly frequent words placed in the middle
rot.per = 0.2, # Rate of words rotated in plot
scale = c(3, 0.3), # Range of words in size
colors = brewer.pal(8, "Dark2"))) # Retrieve 8 colors from the list of "Dark2"