Analysis of tweets after Australia vs. France match (2018 World Cup)

Overview

The purpose of this project was to use the ‘twitteR’ library in R to investigate tweets made after the Australia vs. France group stage match in the 2018 World Cup. 500 tweets that used the #Socceroos hashtag would be pulled after the conclusion of the game which ended with Australia losing 2-1. Tweets would then be cleaned to remove unnecessary punctuation, emojis, white spaces and words to investigate:

Which tweet was most popular?
What was the overall sentiment?
What would a word cloud look like?

It is hoped that after producing results, the emotions and feelings of Socceroos fans can be understood in a way that is simple and does not require sifting through hundreds of tweets.

library(twitteR)
library(dplyr)
library(syuzhet)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(data.table)
library(htmlTable)

All of the above libraries are useful to understand the twitter data:

syuzhet: Producing sentiment and emotions
tm: Text mining package that allows for text to be mined, manipulated and transformed
SnowballC: Additional package for comparison of vocabulary
worldcloud & RColorBrewer: Used to create the wordcloud with multiple colours to indicate frequency

Much of the procedure was obtained from these data science websites: https://towardsdatascience.com/setting-up-twitter-for-text-mining-in-r-bcfc5ba910f4 & http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know

twitteR procedure

To download tweets into R, the twitteR package first had to be installed and called in addition to owning a twitter account. Within twitter, users then had to create a twitter application that allowed access to twitter’s API and search for tweets that carry a certain hashtag (#). Once the search was complete, these tweets could be stored within R and contained information such as their twitter handle, date and time of tweet in addition to whether or not the tweet was favourited or retweeted.

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
tweetsfull <- twitteR::searchTwitter("#socceroos",n=500,lang="en",since='2018-06-16',resultType = 'recent')

The 4 components required for the twitter API access are generated on behalf of a personal account and thus will not be made public. The tweets were then saved as a dataframe in order to be manipulated and for future acecss. 500 of the most recent tweets using the #Socceroos hashtag after Saturday the 16th of June (the day of the match) were taken. Next the tweet itself was then isolated from the dataframe in order to carry out word text mining procedures.

tweets <- tweetsfull %>% select(text)
dplyr::distinct(tweets)

The tweets themselves were then isolated with duplicate tweets removed. We can now move onto the 3 main objectives of this project:

Which tweet was most popular?

mostpop <- tweetsfull %>% select(text,retweetCount,screenName,id) %>%
  filter(retweetCount == max(retweetCount))

The most popular tweet was retweeted 134 which appeared to contain a video of an Australian player being stretched off a field by two paramedics. One of these men then drops their side of the stretcher before fumbling off the player pitch. This was tweeted out before the Australia vs. France match commenced to bring “A bit of humour before the big match”.

## [1] "A bit of humour before the big match . . . #FRAAUS #WorldCupRussia #WorldCup18 #WorldCup #FifaWorldCup2018 #FIFAWorldCup #GoSocceroos #AUS #FRA #Russia2018 #Russia2018WorldCup #LOL #humor #humour #funny #FIFA #Mondiali2018 #football #soccer #Socceroos #Worldcup2018Russia"

As a side note, it appears that users will use many hashtags to spread their video and obtain more followers. At the time of writing, this user had approximately 67500 followers.

What was the overall sentiment?

Before breaking down tweets into sentiment and producing a wordcloud, the tweets had to be manipulated and transformed to make this process much easier. The dataframe containing our tweets then needed to be converted into a corpus file which is essentially text document that can be read into R.

mycorpus <- Corpus(VectorSource(tweets))
mycorpus <- tm_map(mycorpus,content_transformer(tolower))
mycorpus <- tm_map(mycorpus, content_transformer(gsub), pattern="\\W",replace=" ") # Removes emojis
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
mycorpus <- tm_map(mycorpus, content_transformer(removeURL)) # Removes url text
mycorpus <- tm_map(mycorpus, removeWords, stopwords("english")) # Removes stop words
mycorpus <- tm_map(mycorpus, stripWhitespace)
mycorpus <- tm_map(mycorpus, removeNumbers) 
mycorpus <- tm_map(mycorpus, removePunctuation) 
mycorpus <- tm_map(mycorpus, stemDocument) # Stemming removes unnecessary repetition of words

Two important notes here are stop words and stemming. Stop words in the English language are those that aren’t related to emotion or sentiment and are unnecessary ‘filler’ words. Examples include: ‘about’, ‘doing’, ‘especially’ and ‘for’. What’s left after these words have been removed contain essentially the ‘true’ message within a tweet.

Stemming takes the roots of a word such as ‘computation’ or ‘computing’ and returns the stem of these words which would be normally be ‘compute’ but we are given ‘comput’. Unfortunately, stemming may remove the last letter of a stemmed word and fixing this problem would require going through each stem individually. However, enough of the word still remains that the meaning can still be understood.

result <- get_nrc_sentiment(as.character(mycorpus))
result <- result[1,]
t_result <- transpose(result)
colnames(t_result) <- ("Tweets")
t_result$Sentiment <- colnames(result)
t_result <- arrange(t_result,desc(Tweets))

Sentiment	Tweets
Positive	83
Negative	69
Trust	45
Anticipation	44
Anger	33
Joy	29
Surpise	28
Fear	27
Sadness	26
Disgust	19

As we can see in this table, most people had either positive (83) or negative (69) sentiment in tweets that had the #Socceroos hashtag. Of the 10 total emotions, 4 were positive, 5 were negative and 1 was neutral.

Positive	Negative	Neutral
Positive	Negative	Surprise
Trust	Anger
Anticipation	Fear
Joy	Sadness
	Disgust

Whether or not these sentiments were due to the performance of the Socceroos, their opposing team, referee calls or other factors it is difficult to discern from this table alone. Interestingly, there were only 403 tweets analysed for sentiment which meant that 97 were left out, possibly a limitation in the algorithm.

Word cloud

dtm <- TermDocumentMatrix(mycorpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

The words ‘socceroo’ and ‘worldcup’ were the most popular (tweeted 322 and 178 times respectively) followed by other terms related to the World Cup and its vieweing experience. These included ‘footbal’, ‘optus’, ‘watch’ and ‘australia’. As mentioned previously, the stemming process may sometimes remove the last letter of a word which is why ‘socceroo’ appears and not ‘socceroos’ or ‘franc’ instead of ‘france’. Interestingly enough, the word ‘soccer’ was only hashtaged 9 times with many people using the international term of ‘football’.

From this word cloud we can see that many used the #socceroos hashtag but the majority of tweets expressed personal feelings. In fact, 55% of words tweeted were only used by 1 person and that only 7 words were used by more than 50 people.

Analysis of tweets after Australia vs. France match (2018 World Cup)

Aiton McPhee

18 June 2018

Overview

twitteR procedure

Which tweet was most popular?

What was the overall sentiment?

Word cloud

Summary