Mining Social Web

This second part of this course will touch on social media from which we can gather and analyze big data for text mining. Social media refer to a variety of online platforms where we can connect and interact with each other for information and communication. But we start with Twitter.

To begin with, we need to know what Twitter is. Here’s an introduction: https://www.youtube.com/watch?v=YoPcJ5eKFA4

From now on, we are going to learn how to download text data from Twitter and analyze tweets about a topic of your interest. To do so, we need to know the way of mining tweets on Twitter using TAGS.

We will walk through the process of setting up a TAGS archive, linking it to R, and mining it with TidyText. 1. Visit https://tags.hawksey.info/ to set up a TAGS (Twitter Archiving Google Sheet) 2. Make a copy of TAGS sheet in your Google account (your gmail account) by selecting TAGS v6.1 for your easy setup 3. Setup Twitter access under TAGS tab (For this work, you also need a Twitter account) 4. Now we are ready to run for mining tweets including any hashtag of your interest 5. What is a hashtag? https://en.wikipedia.org/wiki/Hashtag 6. Once you’ve mined tweets, their archived Google sheet can be imported into RStudio by

library(httr)
library(readr)
google_sheet_csv <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vT0PXh3JBXch3lxbxRqJYNvL-FYkAVg1SFv1oXJttNz51BlLKkS94o9lH60Vdma4bHRzcpLPMHbXzuy/pub?gid=400689247&single=true&output=csv"
tweets_bts <- read_csv(google_sheet_csv)

## Parsed with column specification:
## cols(
##   id_str = col_double(),
##   from_user = col_character(),
##   text = col_character(),
##   created_at = col_character(),
##   time = col_character(),
##   geo_coordinates = col_character(),
##   user_lang = col_character(),
##   in_reply_to_user_id_str = col_double(),
##   in_reply_to_screen_name = col_character(),
##   from_user_id_str = col_double(),
##   in_reply_to_status_id_str = col_double(),
##   source = col_character(),
##   profile_image_url = col_character(),
##   user_followers_count = col_integer(),
##   user_friends_count = col_integer(),
##   user_location = col_character(),
##   status_url = col_character(),
##   entities_str = col_character()
## )

dplyr package and Pipes in R

The pipe operator %>% is very useful in R

Background of the pipe Operator in R

Let say we have two functions f:A->B and g:B->C, we can chain these two functions together by taking the output of one function and inserting into the next. In short, “changing” means that we pass an intermediate result onto the next function. Here, “g follows f”: g(f(x))

In R, we can pass command from one to the next with the pipe operator. As we’ve seen, our R code is often containing lots of parentheses, ( and ), especially when code is complex: functions are nested in another function that are nested in another function, and so on… This makes R code hard to read and understand. Here’s where %>% comes in to the rescue.

Here’s an example

library(stringr)

bts_sent <- "The group's name, BTS, is an acronym for the Korean expression Bangtan Sonyeondan (Hangul: 방탄소년단; Hanja: 防彈少年團), literally meaning 'Bulletproof Boy Scouts'. The name was conceptualized with the thought that BTS would block out stereotypes, criticisms, and expectations that aim on adolescents like bullets and protect the values and ideals of today’s adolescents. In Japan, they are known as Bōdan Shōnendan (防弾少年団), which translates similarly. On July 2017, BTS announced that in addition to being known as Bangtan Sonyeondan or Bulletproof Boy Scouts, the acronym would also stand for 'Beyond The Scene' as part of their new brand identity. This extended their name to mean 'growing youth BTS who is going beyond the realities they are facing, and going forward."

# For tokenization and counting word frequency in order, we used the following codes:
sort(table(unlist(str_extract_all(tolower(bts_sent), boundary("word")))), decreasing = T)

## 
##            the            and            bts             as           name 
##              8              4              4              3              3 
##           that        acronym    adolescents            are        bangtan 
##              3              2              2              2              2 
##         beyond            boy    bulletproof            for          going 
##              2              2              2              2              2 
##             in             is          known             of             on 
##              2              2              2              2              2 
##         scouts     sonyeondan          their           they             to 
##              2              2              2              2              2 
##          would           少年           2017       addition            aim 
##              2              2              1              1              1 
##           also             an      announced          being          block 
##              1              1              1              1              1 
##          bōdan          brand        bullets conceptualized     criticisms 
##              1              1              1              1              1 
##   expectations     expression       extended         facing        forward 
##              1              1              1              1              1 
##        group's        growing         hangul          hanja         ideals 
##              1              1              1              1              1 
##       identity          japan           july         korean           like 
##              1              1              1              1              1 
##      literally           mean        meaning            new             or 
##              1              1              1              1              1 
##            out           part        protect      realities          scene 
##              1              1              1              1              1 
##      shōnendan      similarly          stand    stereotypes           this 
##              1              1              1              1              1 
##        thought        today’s     translates         values            was 
##              1              1              1              1              1 
##          which            who           with          youth     방탄소년단 
##              1              1              1              1              1 
##             団             團           防弾           防彈 
##              1              1              1              1

But with the help of %>%, we can rewrite the above code as follows:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

bts_table <- bts_sent %>% 
  str_to_lower() %>% 
  str_extract_all(boundary("word")) %>% 
  unlist() %>% 
  table() %>% 
  sort(decreasing = T)
bts_table

## .
##            the            and            bts             as           name 
##              8              4              4              3              3 
##           that        acronym    adolescents            are        bangtan 
##              3              2              2              2              2 
##         beyond            boy    bulletproof            for          going 
##              2              2              2              2              2 
##             in             is          known             of             on 
##              2              2              2              2              2 
##         scouts     sonyeondan          their           they             to 
##              2              2              2              2              2 
##          would           少年           2017       addition            aim 
##              2              2              1              1              1 
##           also             an      announced          being          block 
##              1              1              1              1              1 
##          bōdan          brand        bullets conceptualized     criticisms 
##              1              1              1              1              1 
##   expectations     expression       extended         facing        forward 
##              1              1              1              1              1 
##        group's        growing         hangul          hanja         ideals 
##              1              1              1              1              1 
##       identity          japan           july         korean           like 
##              1              1              1              1              1 
##      literally           mean        meaning            new             or 
##              1              1              1              1              1 
##            out           part        protect      realities          scene 
##              1              1              1              1              1 
##      shōnendan      similarly          stand    stereotypes           this 
##              1              1              1              1              1 
##        thought        today’s     translates         values            was 
##              1              1              1              1              1 
##          which            who           with          youth     방탄소년단 
##              1              1              1              1              1 
##             団             團           防弾           防彈 
##              1              1              1              1

Ready for text-preprocessing?

The dplyr and tidytext packages make tokenization much easier by means of pipe operators.

The only function we need for tokenization is unnest_tokens():

library(tidytext)
bts_sents <- bts_sent %>% 
  str_extract_all(boundary("sentence")) %>% 
  unlist() 
bts_sents

## [1] "The group's name, BTS, is an acronym for the Korean expression Bangtan Sonyeondan (Hangul: 방탄소년단; Hanja: 防彈少年團), literally meaning 'Bulletproof Boy Scouts'. "                                        
## [2] "The name was conceptualized with the thought that BTS would block out stereotypes, criticisms, and expectations that aim on adolescents like bullets and protect the values and ideals of today’s adolescents. "
## [3] "In Japan, they are known as Bōdan Shōnendan (防弾少年団), which translates similarly. "                                                                                                                         
## [4] "On July 2017, BTS announced that in addition to being known as Bangtan Sonyeondan or Bulletproof Boy Scouts, the acronym would also stand for 'Beyond The Scene' as part of their new brand identity. "         
## [5] "This extended their name to mean 'growing youth BTS who is going beyond the realities they are facing, and going forward."

bts_sents_df <- data_frame(line=1:5, text=bts_sents)
bts_sents_df

## # A tibble: 5 x 2
##    line text                                                              
##   <int> <chr>                                                             
## 1     1 "The group's name, BTS, is an acronym for the Korean expression B…
## 2     2 "The name was conceptualized with the thought that BTS would bloc…
## 3     3 "In Japan, they are known as Bōdan Shōnendan (防弾少年団), which trans…
## 4     4 "On July 2017, BTS announced that in addition to being known as B…
## 5     5 This extended their name to mean 'growing youth BTS who is going …

?unnest_tokens

bts_sent_tidy <- bts_sents_df %>% 
  unnest_tokens(word, text)
bts_sent_tidy

## # A tibble: 124 x 2
##     line word   
##    <int> <chr>  
##  1     1 the    
##  2     1 group's
##  3     1 name   
##  4     1 bts    
##  5     1 is     
##  6     1 an     
##  7     1 acronym
##  8     1 for    
##  9     1 the    
## 10     1 korean 
## # ... with 114 more rows

bts_sent_table <- bts_sent_tidy %>% 
  count(word, sort=TRUE)
bts_sent_table

## # A tibble: 84 x 2
##    word            n
##    <chr>       <int>
##  1 the             8
##  2 and             4
##  3 bts             4
##  4 as              3
##  5 name            3
##  6 that            3
##  7 acronym         2
##  8 adolescents     2
##  9 are             2
## 10 bangtan         2
## # ... with 74 more rows

Using pipe operators, the above process can be done in chains as follows:

bts_sent_table <- bts_sent %>% 
  str_extract_all(boundary("sentence")) %>% 
  unlist() %>% 
  data_frame(line=1:5, text=bts_sents) %>% 
  unnest_tokens(word, text) %>% 
  count(word, sort=TRUE)
bts_sent_table

## # A tibble: 84 x 2
##    word            n
##    <chr>       <int>
##  1 the             8
##  2 and             4
##  3 bts             4
##  4 as              3
##  5 name            3
##  6 that            3
##  7 acronym         2
##  8 adolescents     2
##  9 are             2
## 10 bangtan         2
## # ... with 74 more rows

What about tokenizing tweets about #bts?

tweets_bts_table <- tweets_bts %>% 
  filter(user_lang == "en") %>% 
  unnest_tokens(word, text) %>% 
  count(word, sort=TRUE)
tweets_bts_table

## # A tibble: 18,556 x 2
##    word              n
##    <chr>         <int>
##  1 t.co          31026
##  2 https         30943
##  3 bts           27634
##  4 rt            26051
##  5 방탄소년단    11204
##  6 love_yourself  9035
##  7 轉             8796
##  8 tear           8509
##  9 concept        7859
## 10 photo          7738
## # ... with 18,546 more rows

What preprocessing tasks do we need to do?

tweets_bts_tidy <- tweets_bts %>%
  filter(user_lang == "en") %>% 
  mutate(text = str_replace_all(text, "[^[:ascii:]]", "")) %>% 
  unnest_tokens(word, text) %>% 
  count(word, sort=TRUE)
tweets_bts_tidy[1:20,]

## # A tibble: 20 x 2
##    word              n
##    <chr>         <int>
##  1 t.co          31026
##  2 https         30930
##  3 bts           27619
##  4 rt            25889
##  5 love_yourself  9027
##  6 tear           8386
##  7 concept        7859
##  8 photo          7738
##  9 version        7611
## 10 bts_twt        7603
## 11 bighitent      7541
## 12 r              3959
## 13 to             3941
## 14 for            3815
## 15 o              3761
## 16 comeback       3644
## 17 the            3576
## 18 of             3305
## 19 show           3259
## 20 bts_bighit     3194

Let’s do some practices on dplyr

bts_sents_df

## # A tibble: 5 x 2
##    line text                                                              
##   <int> <chr>                                                             
## 1     1 "The group's name, BTS, is an acronym for the Korean expression B…
## 2     2 "The name was conceptualized with the thought that BTS would bloc…
## 3     3 "In Japan, they are known as Bōdan Shōnendan (防弾少年団), which trans…
## 4     4 "On July 2017, BTS announced that in addition to being known as B…
## 5     5 This extended their name to mean 'growing youth BTS who is going …

Filtering out text without “BTS”

bts_sents_df %>% 
  filter(str_detect(text, "(BTS)"))

## # A tibble: 4 x 2
##    line text                                                              
##   <int> <chr>                                                             
## 1     1 "The group's name, BTS, is an acronym for the Korean expression B…
## 2     2 "The name was conceptualized with the thought that BTS would bloc…
## 3     4 "On July 2017, BTS announced that in addition to being known as B…
## 4     5 This extended their name to mean 'growing youth BTS who is going …

Removing non-ASCII characters

bts_sents_df %>% 
  mutate(text=str_replace_all(text, "[^[:ascii:]]", ""))

## # A tibble: 5 x 2
##    line text                                                              
##   <int> <chr>                                                             
## 1     1 "The group's name, BTS, is an acronym for the Korean expression B…
## 2     2 "The name was conceptualized with the thought that BTS would bloc…
## 3     3 "In Japan, they are known as Bdan Shnendan (), which translates s…
## 4     4 "On July 2017, BTS announced that in addition to being known as B…
## 5     5 This extended their name to mean 'growing youth BTS who is going …

Tokenizing sentences into words

bts_sents_df %>% 
  unnest_tokens(word, text)

## # A tibble: 124 x 2
##     line word   
##    <int> <chr>  
##  1     1 the    
##  2     1 group's
##  3     1 name   
##  4     1 bts    
##  5     1 is     
##  6     1 an     
##  7     1 acronym
##  8     1 for    
##  9     1 the    
## 10     1 korean 
## # ... with 114 more rows

Removing stopwords

bts_sents_df %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words)

## Joining, by = "word"

## # A tibble: 62 x 2
##     line word      
##    <int> <chr>     
##  1     1 group's   
##  2     1 bts       
##  3     1 acronym   
##  4     1 korean    
##  5     1 expression
##  6     1 bangtan   
##  7     1 sonyeondan
##  8     1 hangul    
##  9     1 방탄소년단
## 10     1 hanja     
## # ... with 52 more rows

Selecting the “word” column only

bts_sents_df %>% 
  unnest_tokens(word, text) %>% 
  select(word)

## # A tibble: 124 x 1
##    word   
##    <chr>  
##  1 the    
##  2 group's
##  3 name   
##  4 bts    
##  5 is     
##  6 an     
##  7 acronym
##  8 for    
##  9 the    
## 10 korean 
## # ... with 114 more rows

Counting each word

bts_sents_df %>% 
  unnest_tokens(word, text) %>% 
  select(word) %>% 
  count(word, sort=TRUE)

## # A tibble: 84 x 2
##    word            n
##    <chr>       <int>
##  1 the             8
##  2 and             4
##  3 bts             4
##  4 as              3
##  5 name            3
##  6 that            3
##  7 acronym         2
##  8 adolescents     2
##  9 are             2
## 10 bangtan         2
## # ... with 74 more rows

Sorting in ascending order

bts_sents_df %>% 
  unnest_tokens(word, text) %>% 
  select(word) %>% 
  count(word) %>% 
  arrange(n)

## # A tibble: 84 x 2
##    word          n
##    <chr>     <int>
##  1 2017          1
##  2 addition      1
##  3 aim           1
##  4 also          1
##  5 an            1
##  6 announced     1
##  7 being         1
##  8 block         1
##  9 bōdan         1
## 10 brand         1
## # ... with 74 more rows

Selecting top 5 words in frequency

bts_sents_df %>% 
  unnest_tokens(word, text) %>% 
  select(word) %>% 
  count(word, sort=TRUE) %>% 
  top_n(5, n)

## # A tibble: 6 x 2
##   word      n
##   <chr> <int>
## 1 the       8
## 2 and       4
## 3 bts       4
## 4 as        3
## 5 name      3
## 6 that      3

Group work for text preprocessing

Work together on some preprocessing for tokenization and finding top 5 words in frequency from tweets for your interest.

Week 10: Social Web Mining with TidyText

Shin Lee

5/8/2018