All I want for Christmas is…-a text analysis

Reading so many interesting Christmas themed posts on Twitter I thought it would be a nice idea to write something myself. Probably one of the most famous song for this time of the year is Mariah Carey’s “All I want for Christmas is you”. So I decided to find out what All Twitter users want for Christmas. First of all let’s load all the necessary libraries.

#loading the libraries
library(twitteR)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:twitteR':
## 
##     id, location

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(purrr)

## 
## Attaching package: 'purrr'

## The following objects are masked from 'package:dplyr':
## 
##     contains, order_by

library(tokenizers)
library(tidytext)
library(tibble)
library(ggplot2)

Let’s scrape scrape 10000 tweets (excluding retweets) searching the string “All I want for Christmas is”.

options(httr_oauth_cache = TRUE)
setup_twitter_oauth(consumer_key = consumerKey, consumer_secret = consumerSecret,access_token = accessToken,access_secret = accessSecret)

## [1] "Using direct authentication"

## Adding .httr-oauth to .gitignore

tweets<-searchTwitter('all I want for Christmas is exclude:retweets',  n=10000)

Let’s reshape the output of the scraping in a suitable format.

tweets_df <- tbl_df(map_df(tweets, as.data.frame))
raw_texts<-tweets_df$text %>% str_to_lower()

Being interested just in the exact sentence “All I want for Christmas is”, we need to filter out the results not cointaining this string.

texts<-raw_texts[grep("all i want for christmas is",raw_texts)]
length(texts)

## [1] 8788

Since this is a basic analysis, the focus is just on the word right after “is”, excluding stop words. In this case besides the default stopwords of the package tokenizers is necessary to add some others to get meaningful insights. We collect the part of the tweets right after our base string.

text_split<-texts %>% str_split("all i want for christmas is")
my_stopwords<-c(stopwords(),"some","your","https","my","one","two","t.co")
words_after<-lapply(text_split,'[[',2) %>% tokenize_words(stopwords=my_stopwords)

Let’s take the first-word-after and fix the errors of subscript out of bounds.

words<-unlist(lapply(words_after,function(x) try(x[[1]],TRUE)))
words<-words[-grep("subscript out of bounds",words)]

We are left with almost 9000 words. How many of those are related to the song, i.e. contain after “is” the “you”? Bearing in mind that often people use the short-hand notation “u” for “you” and other times attach an excessive number of “U” to remind the tune, we have to fix this.

words<-gsub("([u])\\1+","\\1",words)
words<-gsub("\\<u\\>","you",words)

Let’s get the frequency of our words

freq<-sort(table(words),decreasing = TRUE)
head(freq)

## words
##     you     new  follow    food someone   money 
##    2912     148     103      65      58      54

As expected the most common word is “you”,

And finally a plot of the top 10, excluding of course “you”.

x11()
ggplot(data.frame(sort(freq[2:11],decreasing = FALSE)), aes(x=words, y=Freq)) +
  geom_bar(stat='identity',color="black",fill="red") +
  coord_flip()+
  ggtitle('All I want for Christmas is...(besides "you")')+
  theme(plot.title = element_text(hjust = 0.5))

All I want for Christmas is…-a text analysis

Nicolò Giso

December 23, 2017