Hello everyone! This time I am going to do exercise with text. This is a simple and fun task to do. I would like to create a wordcloud from tweets on twitter about the last series of Avengers film which is Avenger Endgame. The objective is to find out the most word tweeted by people related to the film. This tweet data is obtained from https://www.kaggle.com/kavita5/twitter-dataset-avengersendgame/metadata.
Happy reading!
As usual, I will load all of the needed libraries. I use tidytex() and textclean() for text cleaning while wordcloud() for creating the wordcloud.
Load the data
## # A tibble: 6 x 17
## X1 text favorited favoriteCount replyToSN created truncated
## <dbl> <chr> <lgl> <dbl> <chr> <dttm> <lgl>
## 1 1 "RT ~ FALSE 0 <NA> 2019-04-23 10:43:30 FALSE
## 2 2 "RT ~ FALSE 0 <NA> 2019-04-23 10:43:30 FALSE
## 3 3 "sav~ FALSE 0 <NA> 2019-04-23 10:43:30 FALSE
## 4 4 "RT ~ FALSE 0 <NA> 2019-04-23 10:43:29 FALSE
## 5 5 "RT ~ FALSE 0 <NA> 2019-04-23 10:43:29 FALSE
## 6 6 "RT ~ FALSE 0 <NA> 2019-04-23 10:43:29 FALSE
## # ... with 10 more variables: replyToSID <dbl>, id <dbl>, replyToUID <dbl>,
## # statusSource <chr>, screenName <chr>, retweetCount <dbl>, isRetweet <lgl>,
## # retweeted <lgl>, longitude <dbl>, latitude <dbl>
I will subset for the tweet column only (vector)
## [1] "RT @mrvelstan: literally nobody:\r\nme:\r\n\r\n#AvengersEndgame https://t.co/LR9kFwfD5c"
## [2] "RT @agntecarter: i<U+0092>m emotional, sorry!!\r\n\r\n2014 x 2019\r\n#blackwidow\r\n#captainamerica https://t.co/xcwkCMw18w"
## [3] "saving these bingo cards for tomorrow \r\n<U+00A9>\r\n #AvengersEndgame https://t.co/d6For0jwRb"
## [4] "RT @HelloBoon: Man these #AvengersEndgame ads are everywhere https://t.co/Q0lNf5eJsX"
## [5] "RT @Marvel: We salute you, @ChrisEvans! #CaptainAmerica #AvengersEndgame https://t.co/VlPEpnXYgm"
Next, I am going to clean the tweet in order to obtain the best result of word needed. I create unused object to list the words I would like to remove.
unused <- c("vouchers|world|win|bundle|merch|pay|chance|thanos|follow|inspi|thor|paytm|guys|gif|studios|marvel|prais|black|amp|ads|evans|hemswoh|avenger|avengers|chris|scarlet|johansson|brie|
evans|larson|captain|captains|
hemswoh|widow|red|iron")
clean<- tweets %>%
str_to_lower() %>% #lowercasethe letter
str_remove_all("rt") %>% #remove word rt since it is retweeted
replace_tag() %>% #remove mention (@)
replace_hash() %>% #remove hastags
replace_url() %>% #remove urls
replace_contraction() %>% #remove contracted words
replace_word_elongation() %>% #remove elongated words
replace_symbol() %>% #remove symbols
replace_emoji() %>% #remove emojis
replace_html() %>% #remove html codes and characters
replace_internet_slang() %>% #remove slangs
str_replace_all(pattern = unused,"") %>% #remove the unused object above
strip() #remove non meaning words
head(clean)## [1] "literally nobody me"
## [2] "i m emotional sorry x"
## [3] "saving these bingo cards for tomorrow"
## [4] "man these are everywhere"
## [5] "we salute you"
## [6] "the first nonspoiler critic reactions are here and nearly all are exceptionally positive with many"
After cleaning the text, I am going to convert the vector above to a new dataframe and seperate each word into a row (tokenize)
word<- enframe(clean, value = "tweet", name = NULL) %>%
unnest_tokens(tweet, tweet) %>%
count(tweet, sort = T) %>%
anti_join(stop_words, by = c("tweet" = "word"))
word## # A tibble: 3,293 x 2
## tweet n
## <chr> <int>
## 1 premiere 1611
## 2 cried 896
## 3 salute 823
## 4 carpet 677
## 5 movie 625
## 6 times 603
## 7 exclusive 521
## 8 ready 469
## 9 pop 411
## 10 reactions 334
## # ... with 3,283 more rows
Let’s create the worldcloud!
word %>%
with(
wordcloud(
words = tweet,
freq = n,
max.words =65,
random.order = F,
colors = brewer.pal(name = "Set2",8),
scale = c(5,0.4)
)
)Great! The wordcloud has been built. We can read that the most tweeted words are in a bigger font while the least in smaller fonts.
So, that’s all for the process of wordcloud using packages in R programming language.I hope this page can help you understand text problem and the solution behind it.
See you in the other page!
Author,
Alfado Sembiring
Notes :
In case you want to look up my profile, click the link below :
Jump To My Profile (open link in a new tab)