The pipe operator %>%
is very useful in R
Let say we have two functions f:A->B and g:B->C, we can chain these two functions together by taking the output of one function and inserting into the next. In short, “changing” means that we pass an intermediate result onto the next function. Here, “g follows f”: g(f(x))
In R, we can pass command from one to the next with the pipe operator. As we’ve seen, our R code is often containing lots of parentheses, (
and )
, especially when code is complex: functions are nested in another function that are nested in another function, and so on… This makes R code hard to read and understand. Here’s where %>%
comes in to the rescue.
Here’s an example
library(stringr)
bts_sent <- "The group's name, BTS, is an acronym for the Korean expression Bangtan Sonyeondan (Hangul: 방탄소년단; Hanja: 防彈少年團), literally meaning 'Bulletproof Boy Scouts'. The name was conceptualized with the thought that BTS would block out stereotypes, criticisms, and expectations that aim on adolescents like bullets and protect the values and ideals of today’s adolescents. In Japan, they are known as Bōdan Shōnendan (防弾少年団), which translates similarly. On July 2017, BTS announced that in addition to being known as Bangtan Sonyeondan or Bulletproof Boy Scouts, the acronym would also stand for 'Beyond The Scene' as part of their new brand identity. This extended their name to mean 'growing youth BTS who is going beyond the realities they are facing, and going forward."
# For tokenization and counting word frequency in order, we used the following codes:
sort(table(unlist(str_extract_all(tolower(bts_sent), boundary("word")))), decreasing = T)
##
## the and bts as name
## 8 4 4 3 3
## that acronym adolescents are bangtan
## 3 2 2 2 2
## beyond boy bulletproof for going
## 2 2 2 2 2
## in is known of on
## 2 2 2 2 2
## scouts sonyeondan their they to
## 2 2 2 2 2
## would 少年 2017 addition aim
## 2 2 1 1 1
## also an announced being block
## 1 1 1 1 1
## bōdan brand bullets conceptualized criticisms
## 1 1 1 1 1
## expectations expression extended facing forward
## 1 1 1 1 1
## group's growing hangul hanja ideals
## 1 1 1 1 1
## identity japan july korean like
## 1 1 1 1 1
## literally mean meaning new or
## 1 1 1 1 1
## out part protect realities scene
## 1 1 1 1 1
## shōnendan similarly stand stereotypes this
## 1 1 1 1 1
## thought today’s translates values was
## 1 1 1 1 1
## which who with youth 방탄소년단
## 1 1 1 1 1
## 団 團 防弾 防彈
## 1 1 1 1
But with the help of %>%
, we can rewrite the above code as follows:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
bts_table <- bts_sent %>%
str_to_lower() %>%
str_extract_all(boundary("word")) %>%
unlist() %>%
table() %>%
sort(decreasing = T)
bts_table
## .
## the and bts as name
## 8 4 4 3 3
## that acronym adolescents are bangtan
## 3 2 2 2 2
## beyond boy bulletproof for going
## 2 2 2 2 2
## in is known of on
## 2 2 2 2 2
## scouts sonyeondan their they to
## 2 2 2 2 2
## would 少年 2017 addition aim
## 2 2 1 1 1
## also an announced being block
## 1 1 1 1 1
## bōdan brand bullets conceptualized criticisms
## 1 1 1 1 1
## expectations expression extended facing forward
## 1 1 1 1 1
## group's growing hangul hanja ideals
## 1 1 1 1 1
## identity japan july korean like
## 1 1 1 1 1
## literally mean meaning new or
## 1 1 1 1 1
## out part protect realities scene
## 1 1 1 1 1
## shōnendan similarly stand stereotypes this
## 1 1 1 1 1
## thought today’s translates values was
## 1 1 1 1 1
## which who with youth 방탄소년단
## 1 1 1 1 1
## 団 團 防弾 防彈
## 1 1 1 1
The dplyr and tidytext packages make tokenization much easier by means of pipe operators.
The only function we need for tokenization is unnest_tokens()
:
library(tidytext)
bts_sents <- bts_sent %>%
str_extract_all(boundary("sentence")) %>%
unlist()
bts_sents
## [1] "The group's name, BTS, is an acronym for the Korean expression Bangtan Sonyeondan (Hangul: 방탄소년단; Hanja: 防彈少年團), literally meaning 'Bulletproof Boy Scouts'. "
## [2] "The name was conceptualized with the thought that BTS would block out stereotypes, criticisms, and expectations that aim on adolescents like bullets and protect the values and ideals of today’s adolescents. "
## [3] "In Japan, they are known as Bōdan Shōnendan (防弾少年団), which translates similarly. "
## [4] "On July 2017, BTS announced that in addition to being known as Bangtan Sonyeondan or Bulletproof Boy Scouts, the acronym would also stand for 'Beyond The Scene' as part of their new brand identity. "
## [5] "This extended their name to mean 'growing youth BTS who is going beyond the realities they are facing, and going forward."
bts_sents_df <- data_frame(line=1:5, text=bts_sents)
bts_sents_df
## # A tibble: 5 x 2
## line text
## <int> <chr>
## 1 1 "The group's name, BTS, is an acronym for the Korean expression B…
## 2 2 "The name was conceptualized with the thought that BTS would bloc…
## 3 3 "In Japan, they are known as Bōdan Shōnendan (防弾少年団), which trans…
## 4 4 "On July 2017, BTS announced that in addition to being known as B…
## 5 5 This extended their name to mean 'growing youth BTS who is going …
?unnest_tokens
bts_sent_tidy <- bts_sents_df %>%
unnest_tokens(word, text)
bts_sent_tidy
## # A tibble: 124 x 2
## line word
## <int> <chr>
## 1 1 the
## 2 1 group's
## 3 1 name
## 4 1 bts
## 5 1 is
## 6 1 an
## 7 1 acronym
## 8 1 for
## 9 1 the
## 10 1 korean
## # ... with 114 more rows
bts_sent_table <- bts_sent_tidy %>%
count(word, sort=TRUE)
bts_sent_table
## # A tibble: 84 x 2
## word n
## <chr> <int>
## 1 the 8
## 2 and 4
## 3 bts 4
## 4 as 3
## 5 name 3
## 6 that 3
## 7 acronym 2
## 8 adolescents 2
## 9 are 2
## 10 bangtan 2
## # ... with 74 more rows
Using pipe operators, the above process can be done in chains as follows:
bts_sent_table <- bts_sent %>%
str_extract_all(boundary("sentence")) %>%
unlist() %>%
data_frame(line=1:5, text=bts_sents) %>%
unnest_tokens(word, text) %>%
count(word, sort=TRUE)
bts_sent_table
## # A tibble: 84 x 2
## word n
## <chr> <int>
## 1 the 8
## 2 and 4
## 3 bts 4
## 4 as 3
## 5 name 3
## 6 that 3
## 7 acronym 2
## 8 adolescents 2
## 9 are 2
## 10 bangtan 2
## # ... with 74 more rows
What about tokenizing tweets about #bts?
tweets_bts_table <- tweets_bts %>%
filter(user_lang == "en") %>%
unnest_tokens(word, text) %>%
count(word, sort=TRUE)
tweets_bts_table
## # A tibble: 18,556 x 2
## word n
## <chr> <int>
## 1 t.co 31026
## 2 https 30943
## 3 bts 27634
## 4 rt 26051
## 5 방탄소년단 11204
## 6 love_yourself 9035
## 7 轉 8796
## 8 tear 8509
## 9 concept 7859
## 10 photo 7738
## # ... with 18,546 more rows
What preprocessing tasks do we need to do?
tweets_bts_tidy <- tweets_bts %>%
filter(user_lang == "en") %>%
mutate(text = str_replace_all(text, "[^[:ascii:]]", "")) %>%
unnest_tokens(word, text) %>%
count(word, sort=TRUE)
tweets_bts_tidy[1:20,]
## # A tibble: 20 x 2
## word n
## <chr> <int>
## 1 t.co 31026
## 2 https 30930
## 3 bts 27619
## 4 rt 25889
## 5 love_yourself 9027
## 6 tear 8386
## 7 concept 7859
## 8 photo 7738
## 9 version 7611
## 10 bts_twt 7603
## 11 bighitent 7541
## 12 r 3959
## 13 to 3941
## 14 for 3815
## 15 o 3761
## 16 comeback 3644
## 17 the 3576
## 18 of 3305
## 19 show 3259
## 20 bts_bighit 3194
bts_sents_df
## # A tibble: 5 x 2
## line text
## <int> <chr>
## 1 1 "The group's name, BTS, is an acronym for the Korean expression B…
## 2 2 "The name was conceptualized with the thought that BTS would bloc…
## 3 3 "In Japan, they are known as Bōdan Shōnendan (防弾少年団), which trans…
## 4 4 "On July 2017, BTS announced that in addition to being known as B…
## 5 5 This extended their name to mean 'growing youth BTS who is going …
bts_sents_df %>%
filter(str_detect(text, "(BTS)"))
## # A tibble: 4 x 2
## line text
## <int> <chr>
## 1 1 "The group's name, BTS, is an acronym for the Korean expression B…
## 2 2 "The name was conceptualized with the thought that BTS would bloc…
## 3 4 "On July 2017, BTS announced that in addition to being known as B…
## 4 5 This extended their name to mean 'growing youth BTS who is going …
bts_sents_df %>%
mutate(text=str_replace_all(text, "[^[:ascii:]]", ""))
## # A tibble: 5 x 2
## line text
## <int> <chr>
## 1 1 "The group's name, BTS, is an acronym for the Korean expression B…
## 2 2 "The name was conceptualized with the thought that BTS would bloc…
## 3 3 "In Japan, they are known as Bdan Shnendan (), which translates s…
## 4 4 "On July 2017, BTS announced that in addition to being known as B…
## 5 5 This extended their name to mean 'growing youth BTS who is going …
bts_sents_df %>%
unnest_tokens(word, text)
## # A tibble: 124 x 2
## line word
## <int> <chr>
## 1 1 the
## 2 1 group's
## 3 1 name
## 4 1 bts
## 5 1 is
## 6 1 an
## 7 1 acronym
## 8 1 for
## 9 1 the
## 10 1 korean
## # ... with 114 more rows
bts_sents_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining, by = "word"
## # A tibble: 62 x 2
## line word
## <int> <chr>
## 1 1 group's
## 2 1 bts
## 3 1 acronym
## 4 1 korean
## 5 1 expression
## 6 1 bangtan
## 7 1 sonyeondan
## 8 1 hangul
## 9 1 방탄소년단
## 10 1 hanja
## # ... with 52 more rows
bts_sents_df %>%
unnest_tokens(word, text) %>%
select(word)
## # A tibble: 124 x 1
## word
## <chr>
## 1 the
## 2 group's
## 3 name
## 4 bts
## 5 is
## 6 an
## 7 acronym
## 8 for
## 9 the
## 10 korean
## # ... with 114 more rows
bts_sents_df %>%
unnest_tokens(word, text) %>%
select(word) %>%
count(word, sort=TRUE)
## # A tibble: 84 x 2
## word n
## <chr> <int>
## 1 the 8
## 2 and 4
## 3 bts 4
## 4 as 3
## 5 name 3
## 6 that 3
## 7 acronym 2
## 8 adolescents 2
## 9 are 2
## 10 bangtan 2
## # ... with 74 more rows
bts_sents_df %>%
unnest_tokens(word, text) %>%
select(word) %>%
count(word) %>%
arrange(n)
## # A tibble: 84 x 2
## word n
## <chr> <int>
## 1 2017 1
## 2 addition 1
## 3 aim 1
## 4 also 1
## 5 an 1
## 6 announced 1
## 7 being 1
## 8 block 1
## 9 bōdan 1
## 10 brand 1
## # ... with 74 more rows
bts_sents_df %>%
unnest_tokens(word, text) %>%
select(word) %>%
count(word, sort=TRUE) %>%
top_n(5, n)
## # A tibble: 6 x 2
## word n
## <chr> <int>
## 1 the 8
## 2 and 4
## 3 bts 4
## 4 as 3
## 5 name 3
## 6 that 3
Work together on some preprocessing for tokenization and finding top 5 words in frequency from tweets for your interest.