Introduction

Hello everyone! This time I am going to do exercise with text. This is a simple and fun task to do. I would like to create a wordcloud from tweets on twitter about the last series of Avengers film which is Avenger Endgame. The objective is to find out the most word tweeted by people related to the film. This tweet data is obtained from https://www.kaggle.com/kavita5/twitter-dataset-avengersendgame/metadata.
Happy reading!

Preparation

As usual, I will load all of the needed libraries. I use tidytex() and textclean() for text cleaning while wordcloud() for creating the wordcloud.

library(tidyverse)
library(tidytext)
library(textclean)
library(wordcloud)

Load the data

tweet <- read_csv("tweets.csv")

head(tweet)

## # A tibble: 6 x 17
##      X1 text  favorited favoriteCount replyToSN created             truncated
##   <dbl> <chr> <lgl>             <dbl> <chr>     <dttm>              <lgl>    
## 1     1 "RT ~ FALSE                 0 <NA>      2019-04-23 10:43:30 FALSE    
## 2     2 "RT ~ FALSE                 0 <NA>      2019-04-23 10:43:30 FALSE    
## 3     3 "sav~ FALSE                 0 <NA>      2019-04-23 10:43:30 FALSE    
## 4     4 "RT ~ FALSE                 0 <NA>      2019-04-23 10:43:29 FALSE    
## 5     5 "RT ~ FALSE                 0 <NA>      2019-04-23 10:43:29 FALSE    
## 6     6 "RT ~ FALSE                 0 <NA>      2019-04-23 10:43:29 FALSE    
## # ... with 10 more variables: replyToSID <dbl>, id <dbl>, replyToUID <dbl>,
## #   statusSource <chr>, screenName <chr>, retweetCount <dbl>, isRetweet <lgl>,
## #   retweeted <lgl>, longitude <dbl>, latitude <dbl>

I will subset for the tweet column only (vector)

tweets <- tweet$text
head(tweets,5)

## [1] "RT @mrvelstan: literally nobody:\r\nme:\r\n\r\n#AvengersEndgame https://t.co/LR9kFwfD5c"                                  
## [2] "RT @agntecarter: i<U+0092>m emotional, sorry!!\r\n\r\n2014 x 2019\r\n#blackwidow\r\n#captainamerica https://t.co/xcwkCMw18w"
## [3] "saving these bingo cards for tomorrow \r\n<U+00A9>\r\n #AvengersEndgame https://t.co/d6For0jwRb"                          
## [4] "RT @HelloBoon: Man these #AvengersEndgame ads are everywhere https://t.co/Q0lNf5eJsX"                                     
## [5] "RT @Marvel: We salute you, @ChrisEvans! #CaptainAmerica #AvengersEndgame https://t.co/VlPEpnXYgm"

Text Cleaning

Next, I am going to clean the tweet in order to obtain the best result of word needed. I create unused object to list the words I would like to remove.

unused <- c("vouchers|world|win|bundle|merch|pay|chance|thanos|follow|inspi|thor|paytm|guys|gif|studios|marvel|prais|black|amp|ads|evans|hemswoh|avenger|avengers|chris|scarlet|johansson|brie|
            evans|larson|captain|captains|
            hemswoh|widow|red|iron")

clean<- tweets %>% 
  str_to_lower() %>% #lowercasethe letter
  str_remove_all("rt") %>% #remove word rt since it is retweeted
  replace_tag() %>% #remove mention (@)
  replace_hash() %>% #remove hastags
  replace_url() %>%  #remove urls
  replace_contraction() %>%  #remove contracted words
  replace_word_elongation() %>% #remove elongated words
  replace_symbol() %>% #remove symbols
  replace_emoji() %>% #remove emojis
  replace_html() %>% #remove html codes and characters
  replace_internet_slang() %>% #remove slangs
  str_replace_all(pattern = unused,"") %>%  #remove the unused object above
  strip() #remove non meaning words 

head(clean)

## [1] "literally nobody me"                                                                               
## [2] "i m emotional sorry x"                                                                             
## [3] "saving these bingo cards for tomorrow"                                                             
## [4] "man these are everywhere"                                                                          
## [5] "we salute you"                                                                                     
## [6] "the first nonspoiler critic reactions are here and nearly all are exceptionally positive with many"

Tokenize

After cleaning the text, I am going to convert the vector above to a new dataframe and seperate each word into a row (tokenize)

word<- enframe(clean, value = "tweet", name = NULL) %>% 
  unnest_tokens(tweet, tweet) %>% 
  count(tweet, sort = T) %>% 
  anti_join(stop_words, by = c("tweet" = "word"))

word

## # A tibble: 3,293 x 2
##    tweet         n
##    <chr>     <int>
##  1 premiere   1611
##  2 cried       896
##  3 salute      823
##  4 carpet      677
##  5 movie       625
##  6 times       603
##  7 exclusive   521
##  8 ready       469
##  9 pop         411
## 10 reactions   334
## # ... with 3,283 more rows

WordCloud

Let’s create the worldcloud!

word %>% 
  with(
    wordcloud(
      words = tweet, 
      freq = n,
      max.words =65,
      random.order = F,
      colors = brewer.pal(name = "Set2",8),
      scale = c(5,0.4)
    )
  )

Great! The wordcloud has been built. We can read that the most tweeted words are in a bigger font while the least in smaller fonts.

Ending

So, that’s all for the process of wordcloud using packages in R programming language.I hope this page can help you understand text problem and the solution behind it.

See you in the other page!

Author,
Alfado Sembiring

Notes :
In case you want to look up my profile, click the link below :
Jump To My Profile (open link in a new tab)