Reading so many interesting Christmas themed posts on Twitter I thought it would be a nice idea to write something myself. Probably one of the most famous song for this time of the year is Mariah Carey’s “All I want for Christmas is you”. So I decided to find out what All Twitter users want for Christmas. First of all let’s load all the necessary libraries.
#loading the libraries
library(twitteR)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:twitteR':
##
## id, location
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(purrr)
##
## Attaching package: 'purrr'
## The following objects are masked from 'package:dplyr':
##
## contains, order_by
library(tokenizers)
library(tidytext)
library(tibble)
library(ggplot2)
Let’s scrape scrape 10000 tweets (excluding retweets) searching the string “All I want for Christmas is”.
options(httr_oauth_cache = TRUE)
setup_twitter_oauth(consumer_key = consumerKey, consumer_secret = consumerSecret,access_token = accessToken,access_secret = accessSecret)
## [1] "Using direct authentication"
## Adding .httr-oauth to .gitignore
tweets<-searchTwitter('all I want for Christmas is exclude:retweets', n=10000)
Let’s reshape the output of the scraping in a suitable format.
tweets_df <- tbl_df(map_df(tweets, as.data.frame))
raw_texts<-tweets_df$text %>% str_to_lower()
Being interested just in the exact sentence “All I want for Christmas is”, we need to filter out the results not cointaining this string.
texts<-raw_texts[grep("all i want for christmas is",raw_texts)]
length(texts)
## [1] 8788
Since this is a basic analysis, the focus is just on the word right after “is”, excluding stop words. In this case besides the default stopwords of the package tokenizers is necessary to add some others to get meaningful insights. We collect the part of the tweets right after our base string.
text_split<-texts %>% str_split("all i want for christmas is")
my_stopwords<-c(stopwords(),"some","your","https","my","one","two","t.co")
words_after<-lapply(text_split,'[[',2) %>% tokenize_words(stopwords=my_stopwords)
Let’s take the first-word-after and fix the errors of subscript out of bounds.
words<-unlist(lapply(words_after,function(x) try(x[[1]],TRUE)))
words<-words[-grep("subscript out of bounds",words)]
We are left with almost 9000 words. How many of those are related to the song, i.e. contain after “is” the “you”? Bearing in mind that often people use the short-hand notation “u” for “you” and other times attach an excessive number of “U” to remind the tune, we have to fix this.
words<-gsub("([u])\\1+","\\1",words)
words<-gsub("\\<u\\>","you",words)
Let’s get the frequency of our words
freq<-sort(table(words),decreasing = TRUE)
head(freq)
## words
## you new follow food someone money
## 2912 148 103 65 58 54
As expected the most common word is “you”,
And finally a plot of the top 10, excluding of course “you”.
x11()
ggplot(data.frame(sort(freq[2:11],decreasing = FALSE)), aes(x=words, y=Freq)) +
geom_bar(stat='identity',color="black",fill="red") +
coord_flip()+
ggtitle('All I want for Christmas is...(besides "you")')+
theme(plot.title = element_text(hjust = 0.5))