Introduction

Hello everyone! This time I am going to do exercise with text. This is a simple and fun task to do. I would like to create a wordcloud from tweets on twitter about the last series of Avengers film which is Avenger Endgame. The objective is to find out the most word tweeted by people related to the film. This tweet data is obtained from https://www.kaggle.com/kavita5/twitter-dataset-avengersendgame/metadata.
Happy reading!

Preparation

As usual, I will load all of the needed libraries. I use tidytex() and textclean() for text cleaning while wordcloud() for creating the wordcloud.

Load the data

## # A tibble: 6 x 17
##      X1 text  favorited favoriteCount replyToSN created             truncated
##   <dbl> <chr> <lgl>             <dbl> <chr>     <dttm>              <lgl>    
## 1     1 "RT ~ FALSE                 0 <NA>      2019-04-23 10:43:30 FALSE    
## 2     2 "RT ~ FALSE                 0 <NA>      2019-04-23 10:43:30 FALSE    
## 3     3 "sav~ FALSE                 0 <NA>      2019-04-23 10:43:30 FALSE    
## 4     4 "RT ~ FALSE                 0 <NA>      2019-04-23 10:43:29 FALSE    
## 5     5 "RT ~ FALSE                 0 <NA>      2019-04-23 10:43:29 FALSE    
## 6     6 "RT ~ FALSE                 0 <NA>      2019-04-23 10:43:29 FALSE    
## # ... with 10 more variables: replyToSID <dbl>, id <dbl>, replyToUID <dbl>,
## #   statusSource <chr>, screenName <chr>, retweetCount <dbl>, isRetweet <lgl>,
## #   retweeted <lgl>, longitude <dbl>, latitude <dbl>

I will subset for the tweet column only (vector)

## [1] "RT @mrvelstan: literally nobody:\r\nme:\r\n\r\n#AvengersEndgame https://t.co/LR9kFwfD5c"                                  
## [2] "RT @agntecarter: i<U+0092>m emotional, sorry!!\r\n\r\n2014 x 2019\r\n#blackwidow\r\n#captainamerica https://t.co/xcwkCMw18w"
## [3] "saving these bingo cards for tomorrow \r\n<U+00A9>\r\n #AvengersEndgame https://t.co/d6For0jwRb"                          
## [4] "RT @HelloBoon: Man these #AvengersEndgame ads are everywhere https://t.co/Q0lNf5eJsX"                                     
## [5] "RT @Marvel: We salute you, @ChrisEvans! #CaptainAmerica #AvengersEndgame https://t.co/VlPEpnXYgm"

Tokenize

After cleaning the text, I am going to convert the vector above to a new dataframe and seperate each word into a row (tokenize)

## # A tibble: 3,293 x 2
##    tweet         n
##    <chr>     <int>
##  1 premiere   1611
##  2 cried       896
##  3 salute      823
##  4 carpet      677
##  5 movie       625
##  6 times       603
##  7 exclusive   521
##  8 ready       469
##  9 pop         411
## 10 reactions   334
## # ... with 3,283 more rows

WordCloud

Let’s create the worldcloud!

Great! The wordcloud has been built. We can read that the most tweeted words are in a bigger font while the least in smaller fonts.

Ending

So, that’s all for the process of wordcloud using packages in R programming language.I hope this page can help you understand text problem and the solution behind it.

See you in the other page!

Author,
Alfado Sembiring

Notes :
In case you want to look up my profile, click the link below :
Jump To My Profile (open link in a new tab)