Week7-2: Tokenization using Tidytext

Learnig Objectives

Understand the task of supervised machine learning, and learn about feature representation.
Learn about the way in which textual data are applied to machine learning algorithms.
Introduce tidy data principles and see how to make data tidy with the functions from the magrittr and dplyr packages.
See how the tidytext package applies tidy data principles to text via the unnest_tokens() function.

Tidy text data

Here, tidy text is “a table with one-token-per-row”. A token is a whatever unit of text is meaningful for our analysis: it could be a word, a word pair, a phrase, a sentence, etc.

This means that the process of getting text data tidy is largely a matter of

Deciding what the level of the analysis is going to be - what the “token” is.
Splitting the text into tokens, a process called tokenization.

Let’s take a look at tokenization in more detail.

Tokenization with `unnest_tokens()`

The tidytext package provides a very useful tool for tokenization: unnest_tokens()

The unnest_tokens() function performs tokenization by splitting each text line up into the required tokens, and creating a new data frame with one token per row, that is, tidy text data.

Suppose we want to analyze the individual words in tweets including the hashtags about COVID-19 vaccines. We do this by

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## √ ggplot2 3.3.3     √ purrr   0.3.4
## √ tibble  3.1.0     √ dplyr   1.0.4
## √ tidyr   1.1.2     √ stringr 1.4.0
## √ readr   1.3.1     √ forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidytext)
load("cv_tweets.RData")
cv_tweets

## # A tibble: 114,299 x 91
##    user_id status_id   created_at          screen_name text             source  
##    <chr>   <chr>       <dttm>              <chr>       <chr>            <chr>   
##  1 1652541 1378336545~ 2021-04-03 13:20:15 Reuters     Ukraine approve~ True An~
##  2 1652541 1378823457~ 2021-04-04 21:35:04 Reuters     U.S. says 165 m~ True An~
##  3 1652541 1378848625~ 2021-04-04 23:15:04 Reuters     U.S. says 165 m~ True An~
##  4 1652541 1378700128~ 2021-04-04 13:25:00 Reuters     The U.S. has pu~ Twitter~
##  5 1652541 1378776935~ 2021-04-04 18:30:12 Reuters     U.S. says 165 m~ True An~
##  6 1652541 1378388094~ 2021-04-03 16:45:05 Reuters     Ukraine approve~ True An~
##  7 1652541 1378802069~ 2021-04-04 20:10:04 Reuters     U.S. says 165 m~ True An~
##  8 1652541 1378651093~ 2021-04-04 10:10:09 Reuters     China administe~ True An~
##  9 1652541 1378335309~ 2021-04-03 13:15:20 Reuters     Ukraine approve~ True An~
## 10 1652541 1378386857~ 2021-04-03 16:40:10 Reuters     Ukraine approve~ True An~
## # ... with 114,289 more rows, and 85 more variables: display_text_width <dbl>,
## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, quote_count <int>,
## #   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## #   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## #   media_t.co <list>, media_expanded_url <list>, media_type <list>,
## #   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## #   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
## #   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
## #   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
## #   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## #   quoted_name <chr>, quoted_followers_count <int>,
## #   quoted_friends_count <int>, quoted_statuses_count <int>,
## #   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## #   retweet_source <chr>, retweet_favorite_count <int>,
## #   retweet_retweet_count <int>, retweet_user_id <chr>,
## #   retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## #   country_code <chr>, geo_coords <list>, coords_coords <list>,
## #   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## #   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## #   friends_count <int>, listed_count <int>, statuses_count <int>,
## #   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## #   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>, date <dttm>

cv_tweets %>% 
  unnest_tokens(word, text, token="words")

## # A tibble: 3,140,167 x 91
##    user_id status_id    created_at          screen_name source  display_text_wi~
##    <chr>   <chr>        <dttm>              <chr>       <chr>              <dbl>
##  1 1652541 13783365459~ 2021-04-03 13:20:15 Reuters     True A~               94
##  2 1652541 13783365459~ 2021-04-03 13:20:15 Reuters     True A~               94
##  3 1652541 13783365459~ 2021-04-03 13:20:15 Reuters     True A~               94
##  4 1652541 13783365459~ 2021-04-03 13:20:15 Reuters     True A~               94
##  5 1652541 13783365459~ 2021-04-03 13:20:15 Reuters     True A~               94
##  6 1652541 13783365459~ 2021-04-03 13:20:15 Reuters     True A~               94
##  7 1652541 13783365459~ 2021-04-03 13:20:15 Reuters     True A~               94
##  8 1652541 13783365459~ 2021-04-03 13:20:15 Reuters     True A~               94
##  9 1652541 13783365459~ 2021-04-03 13:20:15 Reuters     True A~               94
## 10 1652541 13783365459~ 2021-04-03 13:20:15 Reuters     True A~               94
## # ... with 3,140,157 more rows, and 85 more variables:
## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, quote_count <int>,
## #   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## #   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## #   media_t.co <list>, media_expanded_url <list>, media_type <list>,
## #   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## #   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
## #   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
## #   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
## #   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## #   quoted_name <chr>, quoted_followers_count <int>,
## #   quoted_friends_count <int>, quoted_statuses_count <int>,
## #   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## #   retweet_source <chr>, retweet_favorite_count <int>,
## #   retweet_retweet_count <int>, retweet_user_id <chr>,
## #   retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## #   country_code <chr>, geo_coords <list>, coords_coords <list>,
## #   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## #   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## #   friends_count <int>, listed_count <int>, statuses_count <int>,
## #   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## #   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>, date <dttm>, word <chr>

Note the arguments passed to unnest_tokens(): * the original data frame containing the text from cv_tweets * the variable name we want to create for the tokens in the new tidy data frame (word) * the variable name where the text is stored in the original data frame (text i.e. the tweets are in cv_tweets$text) * the unit of tokenization (token="words")

Special feature of `unnest_tweets()`

A really nice feature of tidytext is that we can tokenize tweets. This means digital features are considered when splitting string in a column into one-token-per-row. For example, for analyzing tweets, we can explicitly include symbols like @ and # that mean something, while URLs are also kept as a single string.

tweets <- tibble(
   id = 1,
   txt = "@rOpenSci and #rstats see: https://cran.r-project.org"
)
tweets

## # A tibble: 1 x 2
##      id txt                                                  
##   <dbl> <chr>                                                
## 1     1 @rOpenSci and #rstats see: https://cran.r-project.org

tweets %>% 
  unnest_tokens(out, txt, token="words")

## # A tibble: 7 x 2
##      id out        
##   <dbl> <chr>      
## 1     1 ropensci   
## 2     1 and        
## 3     1 rstats     
## 4     1 see        
## 5     1 https      
## 6     1 cran.r     
## 7     1 project.org

# unnest_tokens function does not tokenize for Twitter handles (@), hashtags (#), or URLs,

tweets %>%
   unnest_tweets(out, txt)

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

## # A tibble: 5 x 2
##      id out                       
##   <dbl> <chr>                     
## 1     1 @ropensci                 
## 2     1 and                       
## 3     1 #rstats                   
## 4     1 see                       
## 5     1 https://cran.r-project.org

# Using unnest_tweets function, such digital features are taken into account.

We can apply the unnest_tweets function to our cv_tweets, so we are now in a position to transform the full set of tweets into tidy text format. Before doing so, I want to keep only created_at and text variables, remove any duplicated text, and focus on tweets posted only on March 31st in 2021

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

# floor_date function from the lubridate package takes a date-time object and rounds it down to the nearest boundary of the specified time unit.

cv_tweets %>% 
  mutate(date = floor_date(created_at, unit="days")) %>% 
  filter(date == as.Date("2021-03-31"))

## # A tibble: 22,204 x 91
##    user_id  status_id   created_at          screen_name  text            source 
##    <chr>    <chr>       <dttm>              <chr>        <chr>           <chr>  
##  1 2800824~ 1377409655~ 2021-03-31 23:57:07 Ez4u2say_Ja~ Pfizer study s~ Twitte~
##  2 2836042~ 1377380842~ 2021-03-31 22:02:38 AndyVermaut  Covid 19: Insi~ dlvr.it
##  3 2836042~ 1377266110~ 2021-03-31 14:26:43 AndyVermaut  Over 114 milli~ dlvr.it
##  4 2836042~ 1377340962~ 2021-03-31 19:24:09 AndyVermaut  Sinopharm, Sin~ dlvr.it
##  5 2836042~ 1377266847~ 2021-03-31 14:29:39 AndyVermaut  Ryan Reynolds ~ dlvr.it
##  6 2836042~ 1377281341~ 2021-03-31 15:27:15 AndyVermaut  Pfizer says CO~ dlvr.it
##  7 2836042~ 1377399459~ 2021-03-31 23:16:36 AndyVermaut  Washington sta~ dlvr.it
##  8 2836042~ 1377334770~ 2021-03-31 18:59:33 AndyVermaut  Thunder's Shai~ dlvr.it
##  9 2836042~ 1377341951~ 2021-03-31 19:28:05 AndyVermaut  Internet's 'Hi~ dlvr.it
## 10 2836042~ 1377329654~ 2021-03-31 18:39:13 AndyVermaut  Only Brits and~ dlvr.it
## # ... with 22,194 more rows, and 85 more variables: display_text_width <dbl>,
## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, quote_count <int>,
## #   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## #   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## #   media_t.co <list>, media_expanded_url <list>, media_type <list>,
## #   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## #   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
## #   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
## #   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
## #   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## #   quoted_name <chr>, quoted_followers_count <int>,
## #   quoted_friends_count <int>, quoted_statuses_count <int>,
## #   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## #   retweet_source <chr>, retweet_favorite_count <int>,
## #   retweet_retweet_count <int>, retweet_user_id <chr>,
## #   retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## #   country_code <chr>, geo_coords <list>, coords_coords <list>,
## #   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## #   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## #   friends_count <int>, listed_count <int>, statuses_count <int>,
## #   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## #   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>, date <dttm>

cv_tweets %>% 
  select(created_at, text) %>% 
  filter(!duplicated(text)) %>% 
  mutate(date = floor_date(created_at, unit="day")) %>% 
  filter(date == as.Date("2021-03-31"))

## # A tibble: 22,041 x 3
##    created_at          text                                  date               
##    <dttm>              <chr>                                 <dttm>             
##  1 2021-03-31 23:57:07 Pfizer study suggests COVID-19 vacci~ 2021-03-31 00:00:00
##  2 2021-03-31 22:02:38 Covid 19: Inside the BioNTech vaccin~ 2021-03-31 00:00:00
##  3 2021-03-31 14:26:43 Over 114 million COVID-19 vaccine do~ 2021-03-31 00:00:00
##  4 2021-03-31 19:24:09 Sinopharm, Sinovac COVID-19 vaccine ~ 2021-03-31 00:00:00
##  5 2021-03-31 14:29:39 Ryan Reynolds gets first dose of COV~ 2021-03-31 00:00:00
##  6 2021-03-31 15:27:15 Pfizer says COVID-19 vaccine 100% ef~ 2021-03-31 00:00:00
##  7 2021-03-31 23:16:36 Washington state to expand COVID-19 ~ 2021-03-31 00:00:00
##  8 2021-03-31 18:59:33 Thunder's Shai Gilgeous-Alexander ge~ 2021-03-31 00:00:00
##  9 2021-03-31 19:28:05 Internet's 'Hide the Pain Harold' ac~ 2021-03-31 00:00:00
## 10 2021-03-31 18:39:13 Only Brits and Danes Think Their Gov~ 2021-03-31 00:00:00
## # ... with 22,031 more rows

# as.Date function coverts character representation of date into the object of class "Date" representing calendar dates.

cv_tweets_tidy <- cv_tweets %>% 
  select(created_at, text) %>% 
  filter(!duplicated(text)) %>% 
  mutate(date = floor_date(created_at, unit="day")) %>% 
  filter(date == as.Date("2021-03-31")) %>% 
  unnest_tweets(word, text)

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

cv_tweets_tidy

## # A tibble: 528,130 x 3
##    created_at          date                word      
##    <dttm>              <dttm>              <chr>     
##  1 2021-03-31 23:57:07 2021-03-31 00:00:00 pfizer    
##  2 2021-03-31 23:57:07 2021-03-31 00:00:00 study     
##  3 2021-03-31 23:57:07 2021-03-31 00:00:00 suggests  
##  4 2021-03-31 23:57:07 2021-03-31 00:00:00 covid19   
##  5 2021-03-31 23:57:07 2021-03-31 00:00:00 vaccine   
##  6 2021-03-31 23:57:07 2021-03-31 00:00:00 is        
##  7 2021-03-31 23:57:07 2021-03-31 00:00:00 safe      
##  8 2021-03-31 23:57:07 2021-03-31 00:00:00 protective
##  9 2021-03-31 23:57:07 2021-03-31 00:00:00 in        
## 10 2021-03-31 23:57:07 2021-03-31 00:00:00 younger   
## # ... with 528,120 more rows

And count the word frequency

cv_tweets_tidy %>% 
  count(word, sort = TRUE) # count function quickly counts the unique value of one or more variables

## # A tibble: 54,427 x 2
##    word        n
##    <chr>   <int>
##  1 vaccine 19945
##  2 the     18287
##  3 covid19 16541
##  4 to      13503
##  5 in       9634
##  6 of       9070
##  7 and      8903
##  8 a        7545
##  9 for      6685
## 10 is       5715
## # ... with 54,417 more rows

There are still several problems to be resolved:

Numbers: We can remove a character pattern that does not contain any alphabet letter.
HTML (Hypertext Markup Language) tags: There are some HTML entity references (& = &, < = <, > = >, " = ") that can be matched by the following regex "&|<|>|""

cv_tweets %>% 
  slice(4) %>% 
  pull(text)

## [1] "The U.S. has put Johnson &amp; Johnson in charge of a plant that ruined 15 million doses of its COVID-19 vaccine and stopped British drugmaker AstraZeneca from using the facility, a senior health official said https://t.co/7h2482LLZv https://t.co/z3zdbUkzXB"

cv_tweets %>% 
  select(text) %>% 
  slice(4) %>% 
  unnest_tweets(word, text)

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

## # A tibble: 37 x 1
##    word   
##    <chr>  
##  1 the    
##  2 us     
##  3 has    
##  4 put    
##  5 johnson
##  6 amp    
##  7 johnson
##  8 in     
##  9 charge 
## 10 of     
## # ... with 27 more rows

non-ASCII characters: "[^[:ascii]]+"

cv_tweets %>% 
  slice(12) %>% 
  pull(text)

## [1] "Covid-19 Vaccine Tracker Updates: 2021-04-04\n\n<U+2593><U+2593><U+2593><U+2593><U+2593><U+2591><U+2591><U+2591><U+2591><U+2591> 58.50% +3.97 Jersey\n<U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591> 4.01% +0.87 Jordan\n<U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591> 0.59% +0.12 Kazakhstan\n<U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591> 0.30% +0.13 Kenya\n<U+2593><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591> 14.16% +5.73 Kuwait"

cv_tweets %>% 
  select(text) %>% 
  slice(12) %>% 
  unnest_tweets(word, text)

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

## # A tibble: 25 x 1
##    word      
##    <chr>     
##  1 covid19   
##  2 vaccine   
##  3 tracker   
##  4 updates   
##  5 20210404  
##  6 <U+2593><U+2593><U+2593><U+2593><U+2593><U+2591><U+2591><U+2591><U+2591><U+2591>
##  7 5850      
##  8 +397      
##  9 jersey    
## 10 <U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591>
## # ... with 15 more rows

Apostrophe: Some words include an apostrophe such as 1) the contraction of “do not” to “don’t” and 2) the possessive case of nouns as in the “my father’s house”.

cv_tweets %>% 
  slice(11217) %>% 
  pull(text)

## [1] "Argentina's President Alberto Fernandez has COVID after vaccination - Axios .. Argentina's Pres. Fernandez announced today he's tested positive for COVID-19 <U+2014> after being vaccinated earlier this year with 2 doses of Russia’s Sputnik V coronavirus vaccine. https://t.co/XzPaWsL1Aj"

cv_tweets %>% 
  slice(11217) %>% 
  select(text) %>% 
  unnest_tweets(word, text)

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

## # A tibble: 35 x 1
##    word       
##    <chr>      
##  1 argentinas 
##  2 president  
##  3 alberto    
##  4 fernandez  
##  5 has        
##  6 covid      
##  7 after      
##  8 vaccination
##  9 axios      
## 10 argentinas 
## # ... with 25 more rows

Stop words: We can use a lexicon of stop words provided by the tidytext package.

stop_words

## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ... with 1,139 more rows

stop_words$word # returns the vector of stop words

##    [1] "a"             "a's"           "able"          "about"        
##    [5] "above"         "according"     "accordingly"   "across"       
##    [9] "actually"      "after"         "afterwards"    "again"        
##   [13] "against"       "ain't"         "all"           "allow"        
##   [17] "allows"        "almost"        "alone"         "along"        
##   [21] "already"       "also"          "although"      "always"       
##   [25] "am"            "among"         "amongst"       "an"           
##   [29] "and"           "another"       "any"           "anybody"      
##   [33] "anyhow"        "anyone"        "anything"      "anyway"       
##   [37] "anyways"       "anywhere"      "apart"         "appear"       
##   [41] "appreciate"    "appropriate"   "are"           "aren't"       
##   [45] "around"        "as"            "aside"         "ask"          
##   [49] "asking"        "associated"    "at"            "available"    
##   [53] "away"          "awfully"       "b"             "be"           
##   [57] "became"        "because"       "become"        "becomes"      
##   [61] "becoming"      "been"          "before"        "beforehand"   
##   [65] "behind"        "being"         "believe"       "below"        
##   [69] "beside"        "besides"       "best"          "better"       
##   [73] "between"       "beyond"        "both"          "brief"        
##   [77] "but"           "by"            "c"             "c'mon"        
##   [81] "c's"           "came"          "can"           "can't"        
##   [85] "cannot"        "cant"          "cause"         "causes"       
##   [89] "certain"       "certainly"     "changes"       "clearly"      
##   [93] "co"            "com"           "come"          "comes"        
##   [97] "concerning"    "consequently"  "consider"      "considering"  
##  [101] "contain"       "containing"    "contains"      "corresponding"
##  [105] "could"         "couldn't"      "course"        "currently"    
##  [109] "d"             "definitely"    "described"     "despite"      
##  [113] "did"           "didn't"        "different"     "do"           
##  [117] "does"          "doesn't"       "doing"         "don't"        
##  [121] "done"          "down"          "downwards"     "during"       
##  [125] "e"             "each"          "edu"           "eg"           
##  [129] "eight"         "either"        "else"          "elsewhere"    
##  [133] "enough"        "entirely"      "especially"    "et"           
##  [137] "etc"           "even"          "ever"          "every"        
##  [141] "everybody"     "everyone"      "everything"    "everywhere"   
##  [145] "ex"            "exactly"       "example"       "except"       
##  [149] "f"             "far"           "few"           "fifth"        
##  [153] "first"         "five"          "followed"      "following"    
##  [157] "follows"       "for"           "former"        "formerly"     
##  [161] "forth"         "four"          "from"          "further"      
##  [165] "furthermore"   "g"             "get"           "gets"         
##  [169] "getting"       "given"         "gives"         "go"           
##  [173] "goes"          "going"         "gone"          "got"          
##  [177] "gotten"        "greetings"     "h"             "had"          
##  [181] "hadn't"        "happens"       "hardly"        "has"          
##  [185] "hasn't"        "have"          "haven't"       "having"       
##  [189] "he"            "he's"          "hello"         "help"         
##  [193] "hence"         "her"           "here"          "here's"       
##  [197] "hereafter"     "hereby"        "herein"        "hereupon"     
##  [201] "hers"          "herself"       "hi"            "him"          
##  [205] "himself"       "his"           "hither"        "hopefully"    
##  [209] "how"           "howbeit"       "however"       "i"            
##  [213] "i'd"           "i'll"          "i'm"           "i've"         
##  [217] "ie"            "if"            "ignored"       "immediate"    
##  [221] "in"            "inasmuch"      "inc"           "indeed"       
##  [225] "indicate"      "indicated"     "indicates"     "inner"        
##  [229] "insofar"       "instead"       "into"          "inward"       
##  [233] "is"            "isn't"         "it"            "it'd"         
##  [237] "it'll"         "it's"          "its"           "itself"       
##  [241] "j"             "just"          "k"             "keep"         
##  [245] "keeps"         "kept"          "know"          "knows"        
##  [249] "known"         "l"             "last"          "lately"       
##  [253] "later"         "latter"        "latterly"      "least"        
##  [257] "less"          "lest"          "let"           "let's"        
##  [261] "like"          "liked"         "likely"        "little"       
##  [265] "look"          "looking"       "looks"         "ltd"          
##  [269] "m"             "mainly"        "many"          "may"          
##  [273] "maybe"         "me"            "mean"          "meanwhile"    
##  [277] "merely"        "might"         "more"          "moreover"     
##  [281] "most"          "mostly"        "much"          "must"         
##  [285] "my"            "myself"        "n"             "name"         
##  [289] "namely"        "nd"            "near"          "nearly"       
##  [293] "necessary"     "need"          "needs"         "neither"      
##  [297] "never"         "nevertheless"  "new"           "next"         
##  [301] "nine"          "no"            "nobody"        "non"          
##  [305] "none"          "noone"         "nor"           "normally"     
##  [309] "not"           "nothing"       "novel"         "now"          
##  [313] "nowhere"       "o"             "obviously"     "of"           
##  [317] "off"           "often"         "oh"            "ok"           
##  [321] "okay"          "old"           "on"            "once"         
##  [325] "one"           "ones"          "only"          "onto"         
##  [329] "or"            "other"         "others"        "otherwise"    
##  [333] "ought"         "our"           "ours"          "ourselves"    
##  [337] "out"           "outside"       "over"          "overall"      
##  [341] "own"           "p"             "particular"    "particularly" 
##  [345] "per"           "perhaps"       "placed"        "please"       
##  [349] "plus"          "possible"      "presumably"    "probably"     
##  [353] "provides"      "q"             "que"           "quite"        
##  [357] "qv"            "r"             "rather"        "rd"           
##  [361] "re"            "really"        "reasonably"    "regarding"    
##  [365] "regardless"    "regards"       "relatively"    "respectively" 
##  [369] "right"         "s"             "said"          "same"         
##  [373] "saw"           "say"           "saying"        "says"         
##  [377] "second"        "secondly"      "see"           "seeing"       
##  [381] "seem"          "seemed"        "seeming"       "seems"        
##  [385] "seen"          "self"          "selves"        "sensible"     
##  [389] "sent"          "serious"       "seriously"     "seven"        
##  [393] "several"       "shall"         "she"           "should"       
##  [397] "shouldn't"     "since"         "six"           "so"           
##  [401] "some"          "somebody"      "somehow"       "someone"      
##  [405] "something"     "sometime"      "sometimes"     "somewhat"     
##  [409] "somewhere"     "soon"          "sorry"         "specified"    
##  [413] "specify"       "specifying"    "still"         "sub"          
##  [417] "such"          "sup"           "sure"          "t"            
##  [421] "t's"           "take"          "taken"         "tell"         
##  [425] "tends"         "th"            "than"          "thank"        
##  [429] "thanks"        "thanx"         "that"          "that's"       
##  [433] "thats"         "the"           "their"         "theirs"       
##  [437] "them"          "themselves"    "then"          "thence"       
##  [441] "there"         "there's"       "thereafter"    "thereby"      
##  [445] "therefore"     "therein"       "theres"        "thereupon"    
##  [449] "these"         "they"          "they'd"        "they'll"      
##  [453] "they're"       "they've"       "think"         "third"        
##  [457] "this"          "thorough"      "thoroughly"    "those"        
##  [461] "though"        "three"         "through"       "throughout"   
##  [465] "thru"          "thus"          "to"            "together"     
##  [469] "too"           "took"          "toward"        "towards"      
##  [473] "tried"         "tries"         "truly"         "try"          
##  [477] "trying"        "twice"         "two"           "u"            
##  [481] "un"            "under"         "unfortunately" "unless"       
##  [485] "unlikely"      "until"         "unto"          "up"           
##  [489] "upon"          "us"            "use"           "used"         
##  [493] "useful"        "uses"          "using"         "usually"      
##  [497] "uucp"          "v"             "value"         "various"      
##  [501] "very"          "via"           "viz"           "vs"           
##  [505] "w"             "want"          "wants"         "was"          
##  [509] "wasn't"        "way"           "we"            "we'd"         
##  [513] "we'll"         "we're"         "we've"         "welcome"      
##  [517] "well"          "went"          "were"          "weren't"      
##  [521] "what"          "what's"        "whatever"      "when"         
##  [525] "whence"        "whenever"      "where"         "where's"      
##  [529] "whereafter"    "whereas"       "whereby"       "wherein"      
##  [533] "whereupon"     "wherever"      "whether"       "which"        
##  [537] "while"         "whither"       "who"           "who's"        
##  [541] "whoever"       "whole"         "whom"          "whose"        
##  [545] "why"           "will"          "willing"       "wish"         
##  [549] "with"          "within"        "without"       "won't"        
##  [553] "wonder"        "would"         "would"         "wouldn't"     
##  [557] "x"             "y"             "yes"           "yet"          
##  [561] "you"           "you'd"         "you'll"        "you're"       
##  [565] "you've"        "your"          "yours"         "yourself"     
##  [569] "yourselves"    "z"             "zero"          "i"            
##  [573] "me"            "my"            "myself"        "we"           
##  [577] "our"           "ours"          "ourselves"     "you"          
##  [581] "your"          "yours"         "yourself"      "yourselves"   
##  [585] "he"            "him"           "his"           "himself"      
##  [589] "she"           "her"           "hers"          "herself"      
##  [593] "it"            "its"           "itself"        "they"         
##  [597] "them"          "their"         "theirs"        "themselves"   
##  [601] "what"          "which"         "who"           "whom"         
##  [605] "this"          "that"          "these"         "those"        
##  [609] "am"            "is"            "are"           "was"          
##  [613] "were"          "be"            "been"          "being"        
##  [617] "have"          "has"           "had"           "having"       
##  [621] "do"            "does"          "did"           "doing"        
##  [625] "would"         "should"        "could"         "ought"        
##  [629] "i'm"           "you're"        "he's"          "she's"        
##  [633] "it's"          "we're"         "they're"       "i've"         
##  [637] "you've"        "we've"         "they've"       "i'd"          
##  [641] "you'd"         "he'd"          "she'd"         "we'd"         
##  [645] "they'd"        "i'll"          "you'll"        "he'll"        
##  [649] "she'll"        "we'll"         "they'll"       "isn't"        
##  [653] "aren't"        "wasn't"        "weren't"       "hasn't"       
##  [657] "haven't"       "hadn't"        "doesn't"       "don't"        
##  [661] "didn't"        "won't"         "wouldn't"      "shan't"       
##  [665] "shouldn't"     "can't"         "cannot"        "couldn't"     
##  [669] "mustn't"       "let's"         "that's"        "who's"        
##  [673] "what's"        "here's"        "there's"       "when's"       
##  [677] "where's"       "why's"         "how's"         "a"            
##  [681] "an"            "the"           "and"           "but"          
##  [685] "if"            "or"            "because"       "as"           
##  [689] "until"         "while"         "of"            "at"           
##  [693] "by"            "for"           "with"          "about"        
##  [697] "against"       "between"       "into"          "through"      
##  [701] "during"        "before"        "after"         "above"        
##  [705] "below"         "to"            "from"          "up"           
##  [709] "down"          "in"            "out"           "on"           
##  [713] "off"           "over"          "under"         "again"        
##  [717] "further"       "then"          "once"          "here"         
##  [721] "there"         "when"          "where"         "why"          
##  [725] "how"           "all"           "any"           "both"         
##  [729] "each"          "few"           "more"          "most"         
##  [733] "other"         "some"          "such"          "no"           
##  [737] "nor"           "not"           "only"          "own"          
##  [741] "same"          "so"            "than"          "too"          
##  [745] "very"          "a"             "about"         "above"        
##  [749] "across"        "after"         "again"         "against"      
##  [753] "all"           "almost"        "alone"         "along"        
##  [757] "already"       "also"          "although"      "always"       
##  [761] "among"         "an"            "and"           "another"      
##  [765] "any"           "anybody"       "anyone"        "anything"     
##  [769] "anywhere"      "are"           "area"          "areas"        
##  [773] "around"        "as"            "ask"           "asked"        
##  [777] "asking"        "asks"          "at"            "away"         
##  [781] "back"          "backed"        "backing"       "backs"        
##  [785] "be"            "became"        "because"       "become"       
##  [789] "becomes"       "been"          "before"        "began"        
##  [793] "behind"        "being"         "beings"        "best"         
##  [797] "better"        "between"       "big"           "both"         
##  [801] "but"           "by"            "came"          "can"          
##  [805] "cannot"        "case"          "cases"         "certain"      
##  [809] "certainly"     "clear"         "clearly"       "come"         
##  [813] "could"         "did"           "differ"        "different"    
##  [817] "differently"   "do"            "does"          "done"         
##  [821] "down"          "down"          "downed"        "downing"      
##  [825] "downs"         "during"        "each"          "early"        
##  [829] "either"        "end"           "ended"         "ending"       
##  [833] "ends"          "enough"        "even"          "evenly"       
##  [837] "ever"          "every"         "everybody"     "everyone"     
##  [841] "everything"    "everywhere"    "face"          "faces"        
##  [845] "fact"          "facts"         "far"           "felt"         
##  [849] "few"           "find"          "finds"         "first"        
##  [853] "for"           "four"          "from"          "full"         
##  [857] "fully"         "further"       "furthered"     "furthering"   
##  [861] "furthers"      "gave"          "general"       "generally"    
##  [865] "get"           "gets"          "give"          "given"        
##  [869] "gives"         "go"            "going"         "good"         
##  [873] "goods"         "got"           "great"         "greater"      
##  [877] "greatest"      "group"         "grouped"       "grouping"     
##  [881] "groups"        "had"           "has"           "have"         
##  [885] "having"        "he"            "her"           "here"         
##  [889] "herself"       "high"          "high"          "high"         
##  [893] "higher"        "highest"       "him"           "himself"      
##  [897] "his"           "how"           "however"       "i"            
##  [901] "if"            "important"     "in"            "interest"     
##  [905] "interested"    "interesting"   "interests"     "into"         
##  [909] "is"            "it"            "its"           "itself"       
##  [913] "just"          "keep"          "keeps"         "kind"         
##  [917] "knew"          "know"          "known"         "knows"        
##  [921] "large"         "largely"       "last"          "later"        
##  [925] "latest"        "least"         "less"          "let"          
##  [929] "lets"          "like"          "likely"        "long"         
##  [933] "longer"        "longest"       "made"          "make"         
##  [937] "making"        "man"           "many"          "may"          
##  [941] "me"            "member"        "members"       "men"          
##  [945] "might"         "more"          "most"          "mostly"       
##  [949] "mr"            "mrs"           "much"          "must"         
##  [953] "my"            "myself"        "necessary"     "need"         
##  [957] "needed"        "needing"       "needs"         "never"        
##  [961] "new"           "new"           "newer"         "newest"       
##  [965] "next"          "no"            "nobody"        "non"          
##  [969] "noone"         "not"           "nothing"       "now"          
##  [973] "nowhere"       "number"        "numbers"       "of"           
##  [977] "off"           "often"         "old"           "older"        
##  [981] "oldest"        "on"            "once"          "one"          
##  [985] "only"          "open"          "opened"        "opening"      
##  [989] "opens"         "or"            "order"         "ordered"      
##  [993] "ordering"      "orders"        "other"         "others"       
##  [997] "our"           "out"           "over"          "part"         
## [1001] "parted"        "parting"       "parts"         "per"          
## [1005] "perhaps"       "place"         "places"        "point"        
## [1009] "pointed"       "pointing"      "points"        "possible"     
## [1013] "present"       "presented"     "presenting"    "presents"     
## [1017] "problem"       "problems"      "put"           "puts"         
## [1021] "quite"         "rather"        "really"        "right"        
## [1025] "right"         "room"          "rooms"         "said"         
## [1029] "same"          "saw"           "say"           "says"         
## [1033] "second"        "seconds"       "see"           "seem"         
## [1037] "seemed"        "seeming"       "seems"         "sees"         
## [1041] "several"       "shall"         "she"           "should"       
## [1045] "show"          "showed"        "showing"       "shows"        
## [1049] "side"          "sides"         "since"         "small"        
## [1053] "smaller"       "smallest"      "some"          "somebody"     
## [1057] "someone"       "something"     "somewhere"     "state"        
## [1061] "states"        "still"         "still"         "such"         
## [1065] "sure"          "take"          "taken"         "than"         
## [1069] "that"          "the"           "their"         "them"         
## [1073] "then"          "there"         "therefore"     "these"        
## [1077] "they"          "thing"         "things"        "think"        
## [1081] "thinks"        "this"          "those"         "though"       
## [1085] "thought"       "thoughts"      "three"         "through"      
## [1089] "thus"          "to"            "today"         "together"     
## [1093] "too"           "took"          "toward"        "turn"         
## [1097] "turned"        "turning"       "turns"         "two"          
## [1101] "under"         "until"         "up"            "upon"         
## [1105] "us"            "use"           "used"          "uses"         
## [1109] "very"          "want"          "wanted"        "wanting"      
## [1113] "wants"         "was"           "way"           "ways"         
## [1117] "we"            "well"          "wells"         "went"         
## [1121] "were"          "what"          "when"          "where"        
## [1125] "whether"       "which"         "while"         "who"          
## [1129] "whole"         "whose"         "why"           "will"         
## [1133] "with"          "within"        "without"       "work"         
## [1137] "worked"        "working"       "works"         "would"        
## [1141] "year"          "years"         "yet"           "you"          
## [1145] "young"         "younger"       "youngest"      "your"         
## [1149] "yours"

Using matching operator %in% returns the logical vector for the elements of the vector A that are matched by the elements of the vector B: A %in% B

c("abc","bcd","123") %in% c("abc","efg","hlm")

## [1]  TRUE FALSE FALSE

cv_tweets %>% 
  slice(11217) %>% 
  select(text) %>% 
  unnest_tweets(word, text) %>% 
  filter(!word %in% stop_words$word)

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

## # A tibble: 24 x 1
##    word       
##    <chr>      
##  1 argentinas 
##  2 president  
##  3 alberto    
##  4 fernandez  
##  5 covid      
##  6 vaccination
##  7 axios      
##  8 argentinas 
##  9 pres       
## 10 fernandez  
## # ... with 14 more rows

Text preprocessing steps

non-ASCII characters
HTML tags
Stop words
Apostrophe
Numbers

cv_tweets_tidy <- cv_tweets %>% 
  select(created_at, text) %>% 
  filter(!duplicated(text)) %>% 
  mutate(date = floor_date(created_at, unit="days")) %>% 
  filter(date == as.Date("2021-03-31")) %>% 
  mutate(text = str_replace_all(text, "[#@]?[^[:ascii:]]+", " ")) %>% # non-ASCII characters
  mutate(text = str_replace_all(text, "&amp;|&lt;|&gt;|&quot;|RT", " ")) %>% # HTML tags
  unnest_tweets(word, text) %>% 
  filter(!word %in% stop_words$word) %>% # Stop words
  filter(!word %in% str_remove_all(stop_words$word, "'")) %>% # Apostrophe
  filter(str_detect(word, "[a-z]")) # Numbers

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

cv_tweets_tidy %>% count(word, sort=T)

## # A tibble: 50,038 x 2
##    word          n
##    <chr>     <int>
##  1 vaccine   19996
##  2 covid19   16546
##  3 pfizer     3509
##  4 effective  2288
##  5 #covid19   1808
##  6 people     1618
##  7 covid      1444
##  8 vaccines   1429
##  9 doses      1360
## 10 children   1327
## # ... with 50,028 more rows

Let’s visualize the frequencies of word beginning with the letter “a”, using `wordcloud`

library(wordcloud)

## Loading required package: RColorBrewer

set.seed(415) # set.seed is used to generate the word cloud with the same position of words by the number specified

cv_tweets_tidy %>% 
  filter(str_detect(word, "^a[[:word:]]+")) %>% 
  count(word, sort=TRUE) %>%
  with(wordcloud(words = word, # The with( ) function applies an expression to a dataset. 
                 freq = n, 
                 max.words = 200, # Maximum numbers of words plotted
                 random.order = FALSE, # Highly frequent words placed in the middle
                 rot.per = 0.2, # Rate of words rotated in plot
                 scale = c(3, 0.3), # Range of words in size
                 colors = brewer.pal(8, "Dark2"))) # Retrieve 8 colors from the list of "Dark2"

Week7-2: Tokenization using Tidytext

Shin Lee

4/15/2021

Learnig Objectives

Tidy text data

Tokenization with `unnest_tokens()`

Special feature of `unnest_tweets()`

Text preprocessing steps

Let’s visualize the frequencies of word beginning with the letter “a”, using `wordcloud`

Week7-2: Tokenization using Tidytext

Shin Lee

4/15/2021

Learnig Objectives

Tidy text data

Tokenization with unnest_tokens()

Special feature of unnest_tweets()

Text preprocessing steps

Let’s visualize the frequencies of word beginning with the letter “a”, using wordcloud

Tokenization with `unnest_tokens()`

Special feature of `unnest_tweets()`

Let’s visualize the frequencies of word beginning with the letter “a”, using `wordcloud`