The purpose of this project was to use the ‘twitteR’ library in R to investigate tweets made after the Australia vs. France group stage match in the 2018 World Cup. 500 tweets that used the #Socceroos hashtag would be pulled after the conclusion of the game which ended with Australia losing 2-1. Tweets would then be cleaned to remove unnecessary punctuation, emojis, white spaces and words to investigate:
It is hoped that after producing results, the emotions and feelings of Socceroos fans can be understood in a way that is simple and does not require sifting through hundreds of tweets.
library(twitteR)
library(dplyr)
library(syuzhet)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(data.table)
library(htmlTable)
All of the above libraries are useful to understand the twitter data:
Much of the procedure was obtained from these data science websites: https://towardsdatascience.com/setting-up-twitter-for-text-mining-in-r-bcfc5ba910f4 & http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know
To download tweets into R, the twitteR package first had to be installed and called in addition to owning a twitter account. Within twitter, users then had to create a twitter application that allowed access to twitter’s API and search for tweets that carry a certain hashtag (#). Once the search was complete, these tweets could be stored within R and contained information such as their twitter handle, date and time of tweet in addition to whether or not the tweet was favourited or retweeted.
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
tweetsfull <- twitteR::searchTwitter("#socceroos",n=500,lang="en",since='2018-06-16',resultType = 'recent')
The 4 components required for the twitter API access are generated on behalf of a personal account and thus will not be made public. The tweets were then saved as a dataframe in order to be manipulated and for future acecss. 500 of the most recent tweets using the #Socceroos hashtag after Saturday the 16th of June (the day of the match) were taken. Next the tweet itself was then isolated from the dataframe in order to carry out word text mining procedures.
tweets <- tweetsfull %>% select(text)
dplyr::distinct(tweets)
The tweets themselves were then isolated with duplicate tweets removed. We can now move onto the 3 main objectives of this project:
mostpop <- tweetsfull %>% select(text,retweetCount,screenName,id) %>%
filter(retweetCount == max(retweetCount))
The most popular tweet was retweeted 134 which appeared to contain a video of an Australian player being stretched off a field by two paramedics. One of these men then drops their side of the stretcher before fumbling off the player pitch. This was tweeted out before the Australia vs. France match commenced to bring “A bit of humour before the big match”.
## [1] "A bit of humour before the big match . . . #FRAAUS #WorldCupRussia #WorldCup18 #WorldCup #FifaWorldCup2018 #FIFAWorldCup #GoSocceroos #AUS #FRA #Russia2018 #Russia2018WorldCup #LOL #humor #humour #funny #FIFA #Mondiali2018 #football #soccer #Socceroos #Worldcup2018Russia"
As a side note, it appears that users will use many hashtags to spread their video and obtain more followers. At the time of writing, this user had approximately 67500 followers.
Before breaking down tweets into sentiment and producing a wordcloud, the tweets had to be manipulated and transformed to make this process much easier. The dataframe containing our tweets then needed to be converted into a corpus file which is essentially text document that can be read into R.
mycorpus <- Corpus(VectorSource(tweets))
mycorpus <- tm_map(mycorpus,content_transformer(tolower))
mycorpus <- tm_map(mycorpus, content_transformer(gsub), pattern="\\W",replace=" ") # Removes emojis
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
mycorpus <- tm_map(mycorpus, content_transformer(removeURL)) # Removes url text
mycorpus <- tm_map(mycorpus, removeWords, stopwords("english")) # Removes stop words
mycorpus <- tm_map(mycorpus, stripWhitespace)
mycorpus <- tm_map(mycorpus, removeNumbers)
mycorpus <- tm_map(mycorpus, removePunctuation)
mycorpus <- tm_map(mycorpus, stemDocument) # Stemming removes unnecessary repetition of words
Two important notes here are stop words and stemming. Stop words in the English language are those that aren’t related to emotion or sentiment and are unnecessary ‘filler’ words. Examples include: ‘about’, ‘doing’, ‘especially’ and ‘for’. What’s left after these words have been removed contain essentially the ‘true’ message within a tweet.
Stemming takes the roots of a word such as ‘computation’ or ‘computing’ and returns the stem of these words which would be normally be ‘compute’ but we are given ‘comput’. Unfortunately, stemming may remove the last letter of a stemmed word and fixing this problem would require going through each stem individually. However, enough of the word still remains that the meaning can still be understood.
result <- get_nrc_sentiment(as.character(mycorpus))
result <- result[1,]
t_result <- transpose(result)
colnames(t_result) <- ("Tweets")
t_result$Sentiment <- colnames(result)
t_result <- arrange(t_result,desc(Tweets))
| Sentiment | Tweets |
|---|---|
| Positive | 83 |
| Negative | 69 |
| Trust | 45 |
| Anticipation | 44 |
| Anger | 33 |
| Joy | 29 |
| Surpise | 28 |
| Fear | 27 |
| Sadness | 26 |
| Disgust | 19 |
As we can see in this table, most people had either positive (83) or negative (69) sentiment in tweets that had the #Socceroos hashtag. Of the 10 total emotions, 4 were positive, 5 were negative and 1 was neutral.
| Positive | Negative | Neutral |
|---|---|---|
| Positive | Negative | Surprise |
| Trust | Anger | |
| Anticipation | Fear | |
| Joy | Sadness | |
| Disgust |
Whether or not these sentiments were due to the performance of the Socceroos, their opposing team, referee calls or other factors it is difficult to discern from this table alone. Interestingly, there were only 403 tweets analysed for sentiment which meant that 97 were left out, possibly a limitation in the algorithm.
dtm <- TermDocumentMatrix(mycorpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
The words ‘socceroo’ and ‘worldcup’ were the most popular (tweeted 322 and 178 times respectively) followed by other terms related to the World Cup and its vieweing experience. These included ‘footbal’, ‘optus’, ‘watch’ and ‘australia’. As mentioned previously, the stemming process may sometimes remove the last letter of a word which is why ‘socceroo’ appears and not ‘socceroos’ or ‘franc’ instead of ‘france’. Interestingly enough, the word ‘soccer’ was only hashtaged 9 times with many people using the international term of ‘football’.
From this word cloud we can see that many used the #socceroos hashtag but the majority of tweets expressed personal feelings. In fact, 55% of words tweeted were only used by 1 person and that only 7 words were used by more than 50 people.
Based on tweets using the #socceroos hashtag, we can say that users expressed a wide range of emotions whilst also expressing their personal opinions and feelings. The overall sentiment was mixed after Australia lost their first World Cup match against France with fans showing a range of emotions in their tweets.