Learnig Objectives

  1. Understand the task of supervised machine learning, and learn about feature representation

  2. Learn about the way in which textual data are applied to machine learning algorithms

  3. Introduce tidy data principles and see how to make data tidy with the functions from the magrittr and dplyr packages.

  4. See how the tidytext package applies tidy data principles to text via the unnest_tokens() function.

Tidy text data

Here, tidy text is “a table with one-token-per-row”. A token is a whatever unit of text is meaningful for our analysis: it could be a word, a word pair, a phrase, a sentence, etc.

This means that the process of getting text data tidy is largely a matter of

  1. Deciding what the level of the analysis is going to be - what the “token” is.

  2. Splitting the text into tokens, a process called tokenization.

Let’s take a look at tokenization in more detail.

Tokenization with unnest_tokens()

The tidytext package provides a very useful took for tokenization: unnest_tokens() The unnest_tokens() function performs tokenization by splitting each text line up into the required tokens, and creating a new data frame with one token per row, that is, tidy text data.

Suppose we want to analyze the individual words in tweets including the hashtags about COVID-19. We do this by

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidytext)
load("covid19_tweets_df.RData")
covid19_tweets_df %>% 
  unnest_tokens(word, text, token="words")
## # A tibble: 28,129,247 x 6
##    user_id   status_id       created_at          screen_name  name      word    
##    <chr>     <chr>           <dttm>              <chr>        <chr>     <chr>   
##  1 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… fascina…
##  2 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… news    
##  3 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… in      
##  4 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… england 
##  5 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… uk      
##  6 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… firms   
##  7 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… and     
##  8 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… academi…
##  9 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… have    
## 10 408707568 12433944544188… 2020-03-27 04:28:33 KathleenBur… Kathleen… also    
## # … with 28,129,237 more rows

Note the arguments passed to unnest_tokens(): * the original data frame containing the text from covid19_tweets_df * the variable name we want to create for the tokens in the new tidy data frame (word) * the variable name where the text is stored in the original data frame (text i.e. the tweets are in tweets_bts_eng$text) * the unit of tokenization (token="words")

Special feature of unnest_tweets()

A really nice feature of tidytext is that we can tokenize tweets. This means digital features are considered when splittitng string in a column into one-token-per-row. For example, for analyzing tweets, we can explicitly include symbols like @ and # that mean something, while URLs are also kept as a single string.

tweets <- tibble(
   id = 1,
   txt = "@rOpenSci and #rstats see: https://cran.r-project.org"
)
tweets
## # A tibble: 1 x 2
##      id txt                                                  
##   <dbl> <chr>                                                
## 1     1 @rOpenSci and #rstats see: https://cran.r-project.org
tweets %>% 
  unnest_tokens(out, txt, token="words")
## # A tibble: 7 x 2
##      id out        
##   <dbl> <chr>      
## 1     1 ropensci   
## 2     1 and        
## 3     1 rstats     
## 4     1 see        
## 5     1 https      
## 6     1 cran.r     
## 7     1 project.org
# unnest_tokens function does not tokenize for Twitter handles (@), hashtags (#), or URLs,

tweets %>%
   unnest_tweets(out, txt)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 5 x 2
##      id out                       
##   <dbl> <chr>                     
## 1     1 @ropensci                 
## 2     1 and                       
## 3     1 #rstats                   
## 4     1 see                       
## 5     1 https://cran.r-project.org
# Using unnest_tweets function, such digital features are taken into account.

We can apply the unnest_tweets function to our covid19_tweets_df, so we are now in a position to transform the full set of tweets into tidy text format. Before doing so, I want to keep only created_at and text variables, remove any dulicated text, and focus on tweets posted only on March 27th in 2020.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
covid19_tweets_df %>% 
  select(created_at, text) %>% 
  filter(!duplicated(text)) %>% 
  mutate(date = floor_date(created_at, unit="day")) %>% 
  filter(date == as.Date("2020-03-27")) %>% 
  unnest_tweets(word, text)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 12,774,915 x 3
##    created_at          date                word       
##    <dttm>              <dttm>              <chr>      
##  1 2020-03-27 04:28:33 2020-03-27 00:00:00 fascinating
##  2 2020-03-27 04:28:33 2020-03-27 00:00:00 news       
##  3 2020-03-27 04:28:33 2020-03-27 00:00:00 in         
##  4 2020-03-27 04:28:33 2020-03-27 00:00:00 england    
##  5 2020-03-27 04:28:33 2020-03-27 00:00:00 uk         
##  6 2020-03-27 04:28:33 2020-03-27 00:00:00 firms      
##  7 2020-03-27 04:28:33 2020-03-27 00:00:00 and        
##  8 2020-03-27 04:28:33 2020-03-27 00:00:00 academics  
##  9 2020-03-27 04:28:33 2020-03-27 00:00:00 have       
## 10 2020-03-27 04:28:33 2020-03-27 00:00:00 also       
## # … with 12,774,905 more rows
# as.Date function coverts character representation of date into the object of class "Date" representing calendar dates.

covid19_tweets_tidy <- covid19_tweets_df %>% 
  select(created_at, text) %>% 
  filter(!duplicated(text)) %>% 
  mutate(date = floor_date(created_at, unit="day")) %>% 
  filter(date == as.Date("2020-03-27")) %>% 
  unnest_tweets(word, text)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
covid19_tweets_tidy
## # A tibble: 12,774,915 x 3
##    created_at          date                word       
##    <dttm>              <dttm>              <chr>      
##  1 2020-03-27 04:28:33 2020-03-27 00:00:00 fascinating
##  2 2020-03-27 04:28:33 2020-03-27 00:00:00 news       
##  3 2020-03-27 04:28:33 2020-03-27 00:00:00 in         
##  4 2020-03-27 04:28:33 2020-03-27 00:00:00 england    
##  5 2020-03-27 04:28:33 2020-03-27 00:00:00 uk         
##  6 2020-03-27 04:28:33 2020-03-27 00:00:00 firms      
##  7 2020-03-27 04:28:33 2020-03-27 00:00:00 and        
##  8 2020-03-27 04:28:33 2020-03-27 00:00:00 academics  
##  9 2020-03-27 04:28:33 2020-03-27 00:00:00 have       
## 10 2020-03-27 04:28:33 2020-03-27 00:00:00 also       
## # … with 12,774,905 more rows

And count the word frequency

covid19_tweets_tidy %>% 
  count(word, sort = TRUE)
## # A tibble: 814,531 x 2
##    word          n
##    <chr>     <int>
##  1 the      452109
##  2 to       360861
##  3 of       232595
##  4 and      219941
##  5 covid19  195903
##  6 in       187403
##  7 a        177265
##  8 #covid19 169430
##  9 is       151050
## 10 for      150215
## # … with 814,521 more rows

There are still three problems to be resolved:

  1. Stop words: We can use a lexicon of stop words provided by the stopwords package.
library(stopwords)
stopwords() # returns the vector of stop words
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"       "will"

Using matching operator %in% returns the logical vector for the elements of the vector A that are matched by the elements of the vector B: A %in% B

c("abc","bcd","123") %in% c("abc","efg","hlm")
## [1]  TRUE FALSE FALSE
  1. Numbers: We can remove a character pattern that does not contain any alphabet letter.

  2. HTML (Hypertext Markup Language) tags: There are some HTML entity references (&amp; = &, &lt; = <, &gt; = >, &quot; = ") that can be matched by the following regex "&amp;|&lt;|&gt;|&quot;"

covid19_tweets_df %>% 
  select(created_at, text) %>% 
  slice(12) %>% 
  unnest_tweets(word, text)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 16 x 2
##    created_at          word                   
##    <dttm>              <chr>                  
##  1 2020-03-27 04:19:13 the                    
##  2 2020-03-27 04:19:13 war                    
##  3 2020-03-27 04:19:13 against                
##  4 2020-03-27 04:19:13 #covid19               
##  5 2020-03-27 04:19:13 has                    
##  6 2020-03-27 04:19:13 to                     
##  7 2020-03-27 04:19:13 be                     
##  8 2020-03-27 04:19:13 won                    
##  9 2020-03-27 04:19:13 at                     
## 10 2020-03-27 04:19:13 home                   
## 11 2020-03-27 04:19:13 #stayhome              
## 12 2020-03-27 04:19:13 amp                    
## 13 2020-03-27 04:19:13 stay                   
## 14 2020-03-27 04:19:13 safe                   
## 15 2020-03-27 04:19:13 #coronavirusoutbreak   
## 16 2020-03-27 04:19:13 https://t.co/pbkonlwqey
  1. non-ASCII characters: "[^[:ascii]]+"
covid19_tweets_df %>% 
  select(created_at, text) %>% 
  slice(2) %>% 
  unnest_tweets(word, text)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 6 x 2
##   created_at          word                   
##   <dttm>              <chr>                  
## 1 2020-03-27 04:28:33 https://t.co/6zhx6m6rwx
## 2 2020-03-27 04:28:33 corona                 
## 3 2020-03-27 04:28:33 virus                  
## 4 2020-03-27 04:28:33 rhapsody               
## 5 2020-03-27 04:28:33 🎧🎸🎼🎤               
## 6 2020-03-27 04:28:33 #covid19
covid19_tweets_tidy <- covid19_tweets_df %>% 
  select(created_at, text) %>% 
  filter(!duplicated(text)) %>% 
  mutate(date = floor_date(created_at, unit="day")) %>% 
  filter(date == as.Date("2020-03-27")) %>% 
  mutate(text = str_replace_all(text, "[#@]?[^[:ascii:]]+", " ")) %>% 
  mutate(text = str_replace_all(text, "&amp;|&lt;|&gt;|&quot;|RT", " ")) %>% 
  unnest_tweets(word, text) %>% 
  filter(!word %in% stopwords()) %>% 
  filter(str_detect(word, "[a-z]"))
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
covid19_tweets_tidy %>% count(word, sort=T)
## # A tibble: 768,709 x 2
##    word              n
##    <chr>         <int>
##  1 covid19      196574
##  2 #covid19     170114
##  3 #coronavirus 107914
##  4 people        44139
##  5 s             42049
##  6 can           40212
##  7 us            39279
##  8 cases         37156
##  9 now           37062
## 10 coronavirus   28365
## # … with 768,699 more rows

Let’s visualize the frequencies of word beginning with the letter “a”, using wordcloud

library(wordcloud)
## Loading required package: RColorBrewer
set.seed(428) # set.seed is used to generate the word cloud with the same position of words by the number specified

covid19_tweets_tidy %>% 
  filter(str_detect(word, "^a[[:word:]]+")) %>% 
  count(word, sort=TRUE) %>%
  with(wordcloud(words = word, # The with( ) function applys an expression to a dataset. 
                 freq = n, 
                 max.words = 200, # Maximum numbers of words plotted
                 random.order = FALSE, # Highly frequent words placed in the middle
                 rot.per = 0.2, # Rate of words rotated in plot
                 scale = c(3, 0.3), # Range of words in size
                 colors = brewer.pal(8, "Dark2"))) # Retrieve 8 colors from the list of "Dark2"