Understand the task of supervised machine learning, and learn about feature representation.
Learn about the way in which textual data are applied to machine learning algorithms.
Introduce tidy data principles and see how to make data tidy with the functions from the magrittr
and dplyr
packages.
See how the tidytext
package applies tidy data principles to text via the unnest_tokens()
function.
Here, tidy text is “a table with one-token-per-row”. A token is a whatever unit of text is meaningful for our analysis: it could be a word, a word pair, a phrase, a sentence, etc.
This means that the process of getting text data tidy is largely a matter of
Deciding what the level of the analysis is going to be - what the “token” is.
Splitting the text into tokens, a process called tokenization.
Let’s take a look at tokenization in more detail.
unnest_tokens()
The tidytext
package provides a very useful tool for tokenization: unnest_tokens()
The unnest_tokens()
function performs tokenization by splitting each text line up into the required tokens, and creating a new data frame with one token per row, that is, tidy text data.
Suppose we want to analyze the individual words in tweets including the hashtags about COVID-19 vaccines. We do this by
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## √ ggplot2 3.3.3 √ purrr 0.3.4
## √ tibble 3.1.0 √ dplyr 1.0.4
## √ tidyr 1.1.2 √ stringr 1.4.0
## √ readr 1.3.1 √ forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytext)
load("cv_tweets.RData")
cv_tweets
## # A tibble: 114,299 x 91
## user_id status_id created_at screen_name text source
## <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 1652541 1378336545~ 2021-04-03 13:20:15 Reuters Ukraine approve~ True An~
## 2 1652541 1378823457~ 2021-04-04 21:35:04 Reuters U.S. says 165 m~ True An~
## 3 1652541 1378848625~ 2021-04-04 23:15:04 Reuters U.S. says 165 m~ True An~
## 4 1652541 1378700128~ 2021-04-04 13:25:00 Reuters The U.S. has pu~ Twitter~
## 5 1652541 1378776935~ 2021-04-04 18:30:12 Reuters U.S. says 165 m~ True An~
## 6 1652541 1378388094~ 2021-04-03 16:45:05 Reuters Ukraine approve~ True An~
## 7 1652541 1378802069~ 2021-04-04 20:10:04 Reuters U.S. says 165 m~ True An~
## 8 1652541 1378651093~ 2021-04-04 10:10:09 Reuters China administe~ True An~
## 9 1652541 1378335309~ 2021-04-03 13:15:20 Reuters Ukraine approve~ True An~
## 10 1652541 1378386857~ 2021-04-03 16:40:10 Reuters Ukraine approve~ True An~
## # ... with 114,289 more rows, and 85 more variables: display_text_width <dbl>,
## # reply_to_status_id <chr>, reply_to_user_id <chr>,
## # reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## # favorite_count <int>, retweet_count <int>, quote_count <int>,
## # reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## # urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## # media_t.co <list>, media_expanded_url <list>, media_type <list>,
## # ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## # ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
## # lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
## # quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
## # quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## # quoted_name <chr>, quoted_followers_count <int>,
## # quoted_friends_count <int>, quoted_statuses_count <int>,
## # quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## # retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## # retweet_source <chr>, retweet_favorite_count <int>,
## # retweet_retweet_count <int>, retweet_user_id <chr>,
## # retweet_screen_name <chr>, retweet_name <chr>,
## # retweet_followers_count <int>, retweet_friends_count <int>,
## # retweet_statuses_count <int>, retweet_location <chr>,
## # retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## # place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## # country_code <chr>, geo_coords <list>, coords_coords <list>,
## # bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## # description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## # friends_count <int>, listed_count <int>, statuses_count <int>,
## # favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## # profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## # profile_banner_url <chr>, profile_background_url <chr>,
## # profile_image_url <chr>, date <dttm>
cv_tweets %>%
unnest_tokens(word, text, token="words")
## # A tibble: 3,140,167 x 91
## user_id status_id created_at screen_name source display_text_wi~
## <chr> <chr> <dttm> <chr> <chr> <dbl>
## 1 1652541 13783365459~ 2021-04-03 13:20:15 Reuters True A~ 94
## 2 1652541 13783365459~ 2021-04-03 13:20:15 Reuters True A~ 94
## 3 1652541 13783365459~ 2021-04-03 13:20:15 Reuters True A~ 94
## 4 1652541 13783365459~ 2021-04-03 13:20:15 Reuters True A~ 94
## 5 1652541 13783365459~ 2021-04-03 13:20:15 Reuters True A~ 94
## 6 1652541 13783365459~ 2021-04-03 13:20:15 Reuters True A~ 94
## 7 1652541 13783365459~ 2021-04-03 13:20:15 Reuters True A~ 94
## 8 1652541 13783365459~ 2021-04-03 13:20:15 Reuters True A~ 94
## 9 1652541 13783365459~ 2021-04-03 13:20:15 Reuters True A~ 94
## 10 1652541 13783365459~ 2021-04-03 13:20:15 Reuters True A~ 94
## # ... with 3,140,157 more rows, and 85 more variables:
## # reply_to_status_id <chr>, reply_to_user_id <chr>,
## # reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## # favorite_count <int>, retweet_count <int>, quote_count <int>,
## # reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## # urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## # media_t.co <list>, media_expanded_url <list>, media_type <list>,
## # ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## # ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
## # lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
## # quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
## # quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## # quoted_name <chr>, quoted_followers_count <int>,
## # quoted_friends_count <int>, quoted_statuses_count <int>,
## # quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## # retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## # retweet_source <chr>, retweet_favorite_count <int>,
## # retweet_retweet_count <int>, retweet_user_id <chr>,
## # retweet_screen_name <chr>, retweet_name <chr>,
## # retweet_followers_count <int>, retweet_friends_count <int>,
## # retweet_statuses_count <int>, retweet_location <chr>,
## # retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## # place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## # country_code <chr>, geo_coords <list>, coords_coords <list>,
## # bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## # description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## # friends_count <int>, listed_count <int>, statuses_count <int>,
## # favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## # profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## # profile_banner_url <chr>, profile_background_url <chr>,
## # profile_image_url <chr>, date <dttm>, word <chr>
Note the arguments passed to unnest_tokens()
: * the original data frame containing the text from cv_tweets
* the variable name we want to create for the tokens in the new tidy data frame (word
) * the variable name where the text is stored in the original data frame (text
i.e. the tweets are in cv_tweets$text
) * the unit of tokenization (token="words"
)
unnest_tweets()
A really nice feature of tidytext is that we can tokenize tweets. This means digital features are considered when splitting string in a column into one-token-per-row. For example, for analyzing tweets, we can explicitly include symbols like @
and #
that mean something, while URLs are also kept as a single string.
tweets <- tibble(
id = 1,
txt = "@rOpenSci and #rstats see: https://cran.r-project.org"
)
tweets
## # A tibble: 1 x 2
## id txt
## <dbl> <chr>
## 1 1 @rOpenSci and #rstats see: https://cran.r-project.org
tweets %>%
unnest_tokens(out, txt, token="words")
## # A tibble: 7 x 2
## id out
## <dbl> <chr>
## 1 1 ropensci
## 2 1 and
## 3 1 rstats
## 4 1 see
## 5 1 https
## 6 1 cran.r
## 7 1 project.org
# unnest_tokens function does not tokenize for Twitter handles (@), hashtags (#), or URLs,
tweets %>%
unnest_tweets(out, txt)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 5 x 2
## id out
## <dbl> <chr>
## 1 1 @ropensci
## 2 1 and
## 3 1 #rstats
## 4 1 see
## 5 1 https://cran.r-project.org
# Using unnest_tweets function, such digital features are taken into account.
We can apply the unnest_tweets
function to our cv_tweets
, so we are now in a position to transform the full set of tweets into tidy text format. Before doing so, I want to keep only created_at
and text
variables, remove any duplicated text, and focus on tweets posted only on March 31st in 2021
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
# floor_date function from the lubridate package takes a date-time object and rounds it down to the nearest boundary of the specified time unit.
cv_tweets %>%
mutate(date = floor_date(created_at, unit="days")) %>%
filter(date == as.Date("2021-03-31"))
## # A tibble: 22,204 x 91
## user_id status_id created_at screen_name text source
## <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 2800824~ 1377409655~ 2021-03-31 23:57:07 Ez4u2say_Ja~ Pfizer study s~ Twitte~
## 2 2836042~ 1377380842~ 2021-03-31 22:02:38 AndyVermaut Covid 19: Insi~ dlvr.it
## 3 2836042~ 1377266110~ 2021-03-31 14:26:43 AndyVermaut Over 114 milli~ dlvr.it
## 4 2836042~ 1377340962~ 2021-03-31 19:24:09 AndyVermaut Sinopharm, Sin~ dlvr.it
## 5 2836042~ 1377266847~ 2021-03-31 14:29:39 AndyVermaut Ryan Reynolds ~ dlvr.it
## 6 2836042~ 1377281341~ 2021-03-31 15:27:15 AndyVermaut Pfizer says CO~ dlvr.it
## 7 2836042~ 1377399459~ 2021-03-31 23:16:36 AndyVermaut Washington sta~ dlvr.it
## 8 2836042~ 1377334770~ 2021-03-31 18:59:33 AndyVermaut Thunder's Shai~ dlvr.it
## 9 2836042~ 1377341951~ 2021-03-31 19:28:05 AndyVermaut Internet's 'Hi~ dlvr.it
## 10 2836042~ 1377329654~ 2021-03-31 18:39:13 AndyVermaut Only Brits and~ dlvr.it
## # ... with 22,194 more rows, and 85 more variables: display_text_width <dbl>,
## # reply_to_status_id <chr>, reply_to_user_id <chr>,
## # reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## # favorite_count <int>, retweet_count <int>, quote_count <int>,
## # reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## # urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## # media_t.co <list>, media_expanded_url <list>, media_type <list>,
## # ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## # ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
## # lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
## # quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
## # quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## # quoted_name <chr>, quoted_followers_count <int>,
## # quoted_friends_count <int>, quoted_statuses_count <int>,
## # quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## # retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## # retweet_source <chr>, retweet_favorite_count <int>,
## # retweet_retweet_count <int>, retweet_user_id <chr>,
## # retweet_screen_name <chr>, retweet_name <chr>,
## # retweet_followers_count <int>, retweet_friends_count <int>,
## # retweet_statuses_count <int>, retweet_location <chr>,
## # retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## # place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## # country_code <chr>, geo_coords <list>, coords_coords <list>,
## # bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## # description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## # friends_count <int>, listed_count <int>, statuses_count <int>,
## # favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## # profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## # profile_banner_url <chr>, profile_background_url <chr>,
## # profile_image_url <chr>, date <dttm>
cv_tweets %>%
select(created_at, text) %>%
filter(!duplicated(text)) %>%
mutate(date = floor_date(created_at, unit="day")) %>%
filter(date == as.Date("2021-03-31"))
## # A tibble: 22,041 x 3
## created_at text date
## <dttm> <chr> <dttm>
## 1 2021-03-31 23:57:07 Pfizer study suggests COVID-19 vacci~ 2021-03-31 00:00:00
## 2 2021-03-31 22:02:38 Covid 19: Inside the BioNTech vaccin~ 2021-03-31 00:00:00
## 3 2021-03-31 14:26:43 Over 114 million COVID-19 vaccine do~ 2021-03-31 00:00:00
## 4 2021-03-31 19:24:09 Sinopharm, Sinovac COVID-19 vaccine ~ 2021-03-31 00:00:00
## 5 2021-03-31 14:29:39 Ryan Reynolds gets first dose of COV~ 2021-03-31 00:00:00
## 6 2021-03-31 15:27:15 Pfizer says COVID-19 vaccine 100% ef~ 2021-03-31 00:00:00
## 7 2021-03-31 23:16:36 Washington state to expand COVID-19 ~ 2021-03-31 00:00:00
## 8 2021-03-31 18:59:33 Thunder's Shai Gilgeous-Alexander ge~ 2021-03-31 00:00:00
## 9 2021-03-31 19:28:05 Internet's 'Hide the Pain Harold' ac~ 2021-03-31 00:00:00
## 10 2021-03-31 18:39:13 Only Brits and Danes Think Their Gov~ 2021-03-31 00:00:00
## # ... with 22,031 more rows
# as.Date function coverts character representation of date into the object of class "Date" representing calendar dates.
cv_tweets_tidy <- cv_tweets %>%
select(created_at, text) %>%
filter(!duplicated(text)) %>%
mutate(date = floor_date(created_at, unit="day")) %>%
filter(date == as.Date("2021-03-31")) %>%
unnest_tweets(word, text)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
cv_tweets_tidy
## # A tibble: 528,130 x 3
## created_at date word
## <dttm> <dttm> <chr>
## 1 2021-03-31 23:57:07 2021-03-31 00:00:00 pfizer
## 2 2021-03-31 23:57:07 2021-03-31 00:00:00 study
## 3 2021-03-31 23:57:07 2021-03-31 00:00:00 suggests
## 4 2021-03-31 23:57:07 2021-03-31 00:00:00 covid19
## 5 2021-03-31 23:57:07 2021-03-31 00:00:00 vaccine
## 6 2021-03-31 23:57:07 2021-03-31 00:00:00 is
## 7 2021-03-31 23:57:07 2021-03-31 00:00:00 safe
## 8 2021-03-31 23:57:07 2021-03-31 00:00:00 protective
## 9 2021-03-31 23:57:07 2021-03-31 00:00:00 in
## 10 2021-03-31 23:57:07 2021-03-31 00:00:00 younger
## # ... with 528,120 more rows
And count the word frequency
cv_tweets_tidy %>%
count(word, sort = TRUE) # count function quickly counts the unique value of one or more variables
## # A tibble: 54,427 x 2
## word n
## <chr> <int>
## 1 vaccine 19945
## 2 the 18287
## 3 covid19 16541
## 4 to 13503
## 5 in 9634
## 6 of 9070
## 7 and 8903
## 8 a 7545
## 9 for 6685
## 10 is 5715
## # ... with 54,417 more rows
There are still several problems to be resolved:
Numbers: We can remove a character pattern that does not contain any alphabet letter.
HTML (Hypertext Markup Language) tags: There are some HTML entity references (&
= &, <
= <, >
= >, "
= ") that can be matched by the following regex "&|<|>|""
cv_tweets %>%
slice(4) %>%
pull(text)
## [1] "The U.S. has put Johnson & Johnson in charge of a plant that ruined 15 million doses of its COVID-19 vaccine and stopped British drugmaker AstraZeneca from using the facility, a senior health official said https://t.co/7h2482LLZv https://t.co/z3zdbUkzXB"
cv_tweets %>%
select(text) %>%
slice(4) %>%
unnest_tweets(word, text)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 37 x 1
## word
## <chr>
## 1 the
## 2 us
## 3 has
## 4 put
## 5 johnson
## 6 amp
## 7 johnson
## 8 in
## 9 charge
## 10 of
## # ... with 27 more rows
"[^[:ascii]]+"
cv_tweets %>%
slice(12) %>%
pull(text)
## [1] "Covid-19 Vaccine Tracker Updates: 2021-04-04\n\n<U+2593><U+2593><U+2593><U+2593><U+2593><U+2591><U+2591><U+2591><U+2591><U+2591> 58.50% +3.97 Jersey\n<U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591> 4.01% +0.87 Jordan\n<U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591> 0.59% +0.12 Kazakhstan\n<U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591> 0.30% +0.13 Kenya\n<U+2593><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591> 14.16% +5.73 Kuwait"
cv_tweets %>%
select(text) %>%
slice(12) %>%
unnest_tweets(word, text)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 25 x 1
## word
## <chr>
## 1 covid19
## 2 vaccine
## 3 tracker
## 4 updates
## 5 20210404
## 6 <U+2593><U+2593><U+2593><U+2593><U+2593><U+2591><U+2591><U+2591><U+2591><U+2591>
## 7 5850
## 8 +397
## 9 jersey
## 10 <U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591><U+2591>
## # ... with 15 more rows
cv_tweets %>%
slice(11217) %>%
pull(text)
## [1] "Argentina's President Alberto Fernandez has COVID after vaccination - Axios .. Argentina's Pres. Fernandez announced today he's tested positive for COVID-19 <U+2014> after being vaccinated earlier this year with 2 doses of Russia’s Sputnik V coronavirus vaccine. https://t.co/XzPaWsL1Aj"
cv_tweets %>%
slice(11217) %>%
select(text) %>%
unnest_tweets(word, text)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 35 x 1
## word
## <chr>
## 1 argentinas
## 2 president
## 3 alberto
## 4 fernandez
## 5 has
## 6 covid
## 7 after
## 8 vaccination
## 9 axios
## 10 argentinas
## # ... with 25 more rows
tidytext
package.stop_words
## # A tibble: 1,149 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # ... with 1,139 more rows
stop_words$word # returns the vector of stop words
## [1] "a" "a's" "able" "about"
## [5] "above" "according" "accordingly" "across"
## [9] "actually" "after" "afterwards" "again"
## [13] "against" "ain't" "all" "allow"
## [17] "allows" "almost" "alone" "along"
## [21] "already" "also" "although" "always"
## [25] "am" "among" "amongst" "an"
## [29] "and" "another" "any" "anybody"
## [33] "anyhow" "anyone" "anything" "anyway"
## [37] "anyways" "anywhere" "apart" "appear"
## [41] "appreciate" "appropriate" "are" "aren't"
## [45] "around" "as" "aside" "ask"
## [49] "asking" "associated" "at" "available"
## [53] "away" "awfully" "b" "be"
## [57] "became" "because" "become" "becomes"
## [61] "becoming" "been" "before" "beforehand"
## [65] "behind" "being" "believe" "below"
## [69] "beside" "besides" "best" "better"
## [73] "between" "beyond" "both" "brief"
## [77] "but" "by" "c" "c'mon"
## [81] "c's" "came" "can" "can't"
## [85] "cannot" "cant" "cause" "causes"
## [89] "certain" "certainly" "changes" "clearly"
## [93] "co" "com" "come" "comes"
## [97] "concerning" "consequently" "consider" "considering"
## [101] "contain" "containing" "contains" "corresponding"
## [105] "could" "couldn't" "course" "currently"
## [109] "d" "definitely" "described" "despite"
## [113] "did" "didn't" "different" "do"
## [117] "does" "doesn't" "doing" "don't"
## [121] "done" "down" "downwards" "during"
## [125] "e" "each" "edu" "eg"
## [129] "eight" "either" "else" "elsewhere"
## [133] "enough" "entirely" "especially" "et"
## [137] "etc" "even" "ever" "every"
## [141] "everybody" "everyone" "everything" "everywhere"
## [145] "ex" "exactly" "example" "except"
## [149] "f" "far" "few" "fifth"
## [153] "first" "five" "followed" "following"
## [157] "follows" "for" "former" "formerly"
## [161] "forth" "four" "from" "further"
## [165] "furthermore" "g" "get" "gets"
## [169] "getting" "given" "gives" "go"
## [173] "goes" "going" "gone" "got"
## [177] "gotten" "greetings" "h" "had"
## [181] "hadn't" "happens" "hardly" "has"
## [185] "hasn't" "have" "haven't" "having"
## [189] "he" "he's" "hello" "help"
## [193] "hence" "her" "here" "here's"
## [197] "hereafter" "hereby" "herein" "hereupon"
## [201] "hers" "herself" "hi" "him"
## [205] "himself" "his" "hither" "hopefully"
## [209] "how" "howbeit" "however" "i"
## [213] "i'd" "i'll" "i'm" "i've"
## [217] "ie" "if" "ignored" "immediate"
## [221] "in" "inasmuch" "inc" "indeed"
## [225] "indicate" "indicated" "indicates" "inner"
## [229] "insofar" "instead" "into" "inward"
## [233] "is" "isn't" "it" "it'd"
## [237] "it'll" "it's" "its" "itself"
## [241] "j" "just" "k" "keep"
## [245] "keeps" "kept" "know" "knows"
## [249] "known" "l" "last" "lately"
## [253] "later" "latter" "latterly" "least"
## [257] "less" "lest" "let" "let's"
## [261] "like" "liked" "likely" "little"
## [265] "look" "looking" "looks" "ltd"
## [269] "m" "mainly" "many" "may"
## [273] "maybe" "me" "mean" "meanwhile"
## [277] "merely" "might" "more" "moreover"
## [281] "most" "mostly" "much" "must"
## [285] "my" "myself" "n" "name"
## [289] "namely" "nd" "near" "nearly"
## [293] "necessary" "need" "needs" "neither"
## [297] "never" "nevertheless" "new" "next"
## [301] "nine" "no" "nobody" "non"
## [305] "none" "noone" "nor" "normally"
## [309] "not" "nothing" "novel" "now"
## [313] "nowhere" "o" "obviously" "of"
## [317] "off" "often" "oh" "ok"
## [321] "okay" "old" "on" "once"
## [325] "one" "ones" "only" "onto"
## [329] "or" "other" "others" "otherwise"
## [333] "ought" "our" "ours" "ourselves"
## [337] "out" "outside" "over" "overall"
## [341] "own" "p" "particular" "particularly"
## [345] "per" "perhaps" "placed" "please"
## [349] "plus" "possible" "presumably" "probably"
## [353] "provides" "q" "que" "quite"
## [357] "qv" "r" "rather" "rd"
## [361] "re" "really" "reasonably" "regarding"
## [365] "regardless" "regards" "relatively" "respectively"
## [369] "right" "s" "said" "same"
## [373] "saw" "say" "saying" "says"
## [377] "second" "secondly" "see" "seeing"
## [381] "seem" "seemed" "seeming" "seems"
## [385] "seen" "self" "selves" "sensible"
## [389] "sent" "serious" "seriously" "seven"
## [393] "several" "shall" "she" "should"
## [397] "shouldn't" "since" "six" "so"
## [401] "some" "somebody" "somehow" "someone"
## [405] "something" "sometime" "sometimes" "somewhat"
## [409] "somewhere" "soon" "sorry" "specified"
## [413] "specify" "specifying" "still" "sub"
## [417] "such" "sup" "sure" "t"
## [421] "t's" "take" "taken" "tell"
## [425] "tends" "th" "than" "thank"
## [429] "thanks" "thanx" "that" "that's"
## [433] "thats" "the" "their" "theirs"
## [437] "them" "themselves" "then" "thence"
## [441] "there" "there's" "thereafter" "thereby"
## [445] "therefore" "therein" "theres" "thereupon"
## [449] "these" "they" "they'd" "they'll"
## [453] "they're" "they've" "think" "third"
## [457] "this" "thorough" "thoroughly" "those"
## [461] "though" "three" "through" "throughout"
## [465] "thru" "thus" "to" "together"
## [469] "too" "took" "toward" "towards"
## [473] "tried" "tries" "truly" "try"
## [477] "trying" "twice" "two" "u"
## [481] "un" "under" "unfortunately" "unless"
## [485] "unlikely" "until" "unto" "up"
## [489] "upon" "us" "use" "used"
## [493] "useful" "uses" "using" "usually"
## [497] "uucp" "v" "value" "various"
## [501] "very" "via" "viz" "vs"
## [505] "w" "want" "wants" "was"
## [509] "wasn't" "way" "we" "we'd"
## [513] "we'll" "we're" "we've" "welcome"
## [517] "well" "went" "were" "weren't"
## [521] "what" "what's" "whatever" "when"
## [525] "whence" "whenever" "where" "where's"
## [529] "whereafter" "whereas" "whereby" "wherein"
## [533] "whereupon" "wherever" "whether" "which"
## [537] "while" "whither" "who" "who's"
## [541] "whoever" "whole" "whom" "whose"
## [545] "why" "will" "willing" "wish"
## [549] "with" "within" "without" "won't"
## [553] "wonder" "would" "would" "wouldn't"
## [557] "x" "y" "yes" "yet"
## [561] "you" "you'd" "you'll" "you're"
## [565] "you've" "your" "yours" "yourself"
## [569] "yourselves" "z" "zero" "i"
## [573] "me" "my" "myself" "we"
## [577] "our" "ours" "ourselves" "you"
## [581] "your" "yours" "yourself" "yourselves"
## [585] "he" "him" "his" "himself"
## [589] "she" "her" "hers" "herself"
## [593] "it" "its" "itself" "they"
## [597] "them" "their" "theirs" "themselves"
## [601] "what" "which" "who" "whom"
## [605] "this" "that" "these" "those"
## [609] "am" "is" "are" "was"
## [613] "were" "be" "been" "being"
## [617] "have" "has" "had" "having"
## [621] "do" "does" "did" "doing"
## [625] "would" "should" "could" "ought"
## [629] "i'm" "you're" "he's" "she's"
## [633] "it's" "we're" "they're" "i've"
## [637] "you've" "we've" "they've" "i'd"
## [641] "you'd" "he'd" "she'd" "we'd"
## [645] "they'd" "i'll" "you'll" "he'll"
## [649] "she'll" "we'll" "they'll" "isn't"
## [653] "aren't" "wasn't" "weren't" "hasn't"
## [657] "haven't" "hadn't" "doesn't" "don't"
## [661] "didn't" "won't" "wouldn't" "shan't"
## [665] "shouldn't" "can't" "cannot" "couldn't"
## [669] "mustn't" "let's" "that's" "who's"
## [673] "what's" "here's" "there's" "when's"
## [677] "where's" "why's" "how's" "a"
## [681] "an" "the" "and" "but"
## [685] "if" "or" "because" "as"
## [689] "until" "while" "of" "at"
## [693] "by" "for" "with" "about"
## [697] "against" "between" "into" "through"
## [701] "during" "before" "after" "above"
## [705] "below" "to" "from" "up"
## [709] "down" "in" "out" "on"
## [713] "off" "over" "under" "again"
## [717] "further" "then" "once" "here"
## [721] "there" "when" "where" "why"
## [725] "how" "all" "any" "both"
## [729] "each" "few" "more" "most"
## [733] "other" "some" "such" "no"
## [737] "nor" "not" "only" "own"
## [741] "same" "so" "than" "too"
## [745] "very" "a" "about" "above"
## [749] "across" "after" "again" "against"
## [753] "all" "almost" "alone" "along"
## [757] "already" "also" "although" "always"
## [761] "among" "an" "and" "another"
## [765] "any" "anybody" "anyone" "anything"
## [769] "anywhere" "are" "area" "areas"
## [773] "around" "as" "ask" "asked"
## [777] "asking" "asks" "at" "away"
## [781] "back" "backed" "backing" "backs"
## [785] "be" "became" "because" "become"
## [789] "becomes" "been" "before" "began"
## [793] "behind" "being" "beings" "best"
## [797] "better" "between" "big" "both"
## [801] "but" "by" "came" "can"
## [805] "cannot" "case" "cases" "certain"
## [809] "certainly" "clear" "clearly" "come"
## [813] "could" "did" "differ" "different"
## [817] "differently" "do" "does" "done"
## [821] "down" "down" "downed" "downing"
## [825] "downs" "during" "each" "early"
## [829] "either" "end" "ended" "ending"
## [833] "ends" "enough" "even" "evenly"
## [837] "ever" "every" "everybody" "everyone"
## [841] "everything" "everywhere" "face" "faces"
## [845] "fact" "facts" "far" "felt"
## [849] "few" "find" "finds" "first"
## [853] "for" "four" "from" "full"
## [857] "fully" "further" "furthered" "furthering"
## [861] "furthers" "gave" "general" "generally"
## [865] "get" "gets" "give" "given"
## [869] "gives" "go" "going" "good"
## [873] "goods" "got" "great" "greater"
## [877] "greatest" "group" "grouped" "grouping"
## [881] "groups" "had" "has" "have"
## [885] "having" "he" "her" "here"
## [889] "herself" "high" "high" "high"
## [893] "higher" "highest" "him" "himself"
## [897] "his" "how" "however" "i"
## [901] "if" "important" "in" "interest"
## [905] "interested" "interesting" "interests" "into"
## [909] "is" "it" "its" "itself"
## [913] "just" "keep" "keeps" "kind"
## [917] "knew" "know" "known" "knows"
## [921] "large" "largely" "last" "later"
## [925] "latest" "least" "less" "let"
## [929] "lets" "like" "likely" "long"
## [933] "longer" "longest" "made" "make"
## [937] "making" "man" "many" "may"
## [941] "me" "member" "members" "men"
## [945] "might" "more" "most" "mostly"
## [949] "mr" "mrs" "much" "must"
## [953] "my" "myself" "necessary" "need"
## [957] "needed" "needing" "needs" "never"
## [961] "new" "new" "newer" "newest"
## [965] "next" "no" "nobody" "non"
## [969] "noone" "not" "nothing" "now"
## [973] "nowhere" "number" "numbers" "of"
## [977] "off" "often" "old" "older"
## [981] "oldest" "on" "once" "one"
## [985] "only" "open" "opened" "opening"
## [989] "opens" "or" "order" "ordered"
## [993] "ordering" "orders" "other" "others"
## [997] "our" "out" "over" "part"
## [1001] "parted" "parting" "parts" "per"
## [1005] "perhaps" "place" "places" "point"
## [1009] "pointed" "pointing" "points" "possible"
## [1013] "present" "presented" "presenting" "presents"
## [1017] "problem" "problems" "put" "puts"
## [1021] "quite" "rather" "really" "right"
## [1025] "right" "room" "rooms" "said"
## [1029] "same" "saw" "say" "says"
## [1033] "second" "seconds" "see" "seem"
## [1037] "seemed" "seeming" "seems" "sees"
## [1041] "several" "shall" "she" "should"
## [1045] "show" "showed" "showing" "shows"
## [1049] "side" "sides" "since" "small"
## [1053] "smaller" "smallest" "some" "somebody"
## [1057] "someone" "something" "somewhere" "state"
## [1061] "states" "still" "still" "such"
## [1065] "sure" "take" "taken" "than"
## [1069] "that" "the" "their" "them"
## [1073] "then" "there" "therefore" "these"
## [1077] "they" "thing" "things" "think"
## [1081] "thinks" "this" "those" "though"
## [1085] "thought" "thoughts" "three" "through"
## [1089] "thus" "to" "today" "together"
## [1093] "too" "took" "toward" "turn"
## [1097] "turned" "turning" "turns" "two"
## [1101] "under" "until" "up" "upon"
## [1105] "us" "use" "used" "uses"
## [1109] "very" "want" "wanted" "wanting"
## [1113] "wants" "was" "way" "ways"
## [1117] "we" "well" "wells" "went"
## [1121] "were" "what" "when" "where"
## [1125] "whether" "which" "while" "who"
## [1129] "whole" "whose" "why" "will"
## [1133] "with" "within" "without" "work"
## [1137] "worked" "working" "works" "would"
## [1141] "year" "years" "yet" "you"
## [1145] "young" "younger" "youngest" "your"
## [1149] "yours"
Using matching operator %in%
returns the logical vector for the elements of the vector A that are matched by the elements of the vector B: A %in% B
c("abc","bcd","123") %in% c("abc","efg","hlm")
## [1] TRUE FALSE FALSE
cv_tweets %>%
slice(11217) %>%
select(text) %>%
unnest_tweets(word, text) %>%
filter(!word %in% stop_words$word)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 24 x 1
## word
## <chr>
## 1 argentinas
## 2 president
## 3 alberto
## 4 fernandez
## 5 covid
## 6 vaccination
## 7 axios
## 8 argentinas
## 9 pres
## 10 fernandez
## # ... with 14 more rows
cv_tweets_tidy <- cv_tweets %>%
select(created_at, text) %>%
filter(!duplicated(text)) %>%
mutate(date = floor_date(created_at, unit="days")) %>%
filter(date == as.Date("2021-03-31")) %>%
mutate(text = str_replace_all(text, "[#@]?[^[:ascii:]]+", " ")) %>% # non-ASCII characters
mutate(text = str_replace_all(text, "&|<|>|"|RT", " ")) %>% # HTML tags
unnest_tweets(word, text) %>%
filter(!word %in% stop_words$word) %>% # Stop words
filter(!word %in% str_remove_all(stop_words$word, "'")) %>% # Apostrophe
filter(str_detect(word, "[a-z]")) # Numbers
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
cv_tweets_tidy %>% count(word, sort=T)
## # A tibble: 50,038 x 2
## word n
## <chr> <int>
## 1 vaccine 19996
## 2 covid19 16546
## 3 pfizer 3509
## 4 effective 2288
## 5 #covid19 1808
## 6 people 1618
## 7 covid 1444
## 8 vaccines 1429
## 9 doses 1360
## 10 children 1327
## # ... with 50,028 more rows
wordcloud
library(wordcloud)
## Loading required package: RColorBrewer
set.seed(415) # set.seed is used to generate the word cloud with the same position of words by the number specified
cv_tweets_tidy %>%
filter(str_detect(word, "^a[[:word:]]+")) %>%
count(word, sort=TRUE) %>%
with(wordcloud(words = word, # The with( ) function applies an expression to a dataset.
freq = n,
max.words = 200, # Maximum numbers of words plotted
random.order = FALSE, # Highly frequent words placed in the middle
rot.per = 0.2, # Rate of words rotated in plot
scale = c(3, 0.3), # Range of words in size
colors = brewer.pal(8, "Dark2"))) # Retrieve 8 colors from the list of "Dark2"