A Simple Twitter Data Mining Trip

1. Installing and loading the libraries.

To download data from twitter you need to have api keys that you can find in the application manager in your twitter page. To authorize R to use your api for mining you have to install several packages. Some of them are for the purpose of authorization and some of them for the data cleaning. You have to install the packages only once.

install.packages("tm") ; install.packages("httpuv")
install.packages('base64enc') ;install.packages("twitteR")
install.packages("wordcloud")
install.packages("RColorBrewer") ; install.packages("tm") 
install.packages("stringr")

We load the packages.

library(wordcloud) ; library(RColorBrewer) ; library(tm) ; library(stringr)
library(twitteR) ; library(httr) ; library(devtools) ; library(base64enc)

2. Connecting to Twitter and downloading tweets

Using the twitteR package we connect to Twitter and authorize our session.

consumer_key <- "Your consumer key"
consumer_secret <- "Your consumer secret"
access_token <- "Your access token" #if no access token available, set to NULL
access_secret <- "Your access secret" #the same rules apply as access token

setup_twitter_oauth(consumer_key = consumer_key,
                    consumer_secret = consumer_secret,
                    access_token = access_token,
                    access_secret = access_secret)

## [1] "Using direct authentication"

We download some data for the #datascience. The language is set to English and the number of tweets is 2000.

dtsc <- searchTwitter("#datascience",n=2000,
                      lang = "en")

3. Cleaning the data and creating the frequencies data frame

We get the text only from the tweets and using the stringr package we clean the text. First we use it to remove links and then remove everything that is not text.

dtsc_text <- sapply(dtsc, function(x) x$getText())
dtsc_text <- gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", dtsc_text)
dtsc_text <- gsub("(f|ht)tps(s?)://(.*)[.][a-z]+", "", dtsc_text)
dtsc_text <- gsub("https","",dtsc_text)
dtsc_text <- str_replace_all(dtsc_text,"[^a-zA-Z\\s]", " ")

Using the tm package we transform it to corpus, a collection of texts, continue cleaning and then break down the words using the stopword “datascience”.

dtsc_corpus <- Corpus(VectorSource(dtsc_text))
dtsc_corpus <- tm_map(dtsc_corpus,PlainTextDocument)
tdm <- TermDocumentMatrix(
        dtsc_corpus,
        control = list(
                stopwords = c("datascience",
                              stopwords("english")))
)

We count the frequencies of the words, sort it and create a data frame with them.

m<-as.matrix(tdm)
word_freqs <- sort(rowSums(m), decreasing = TRUE)
word_freqs<-word_freqs
dm <- data.frame(word = names(word_freqs), freq = word_freqs)

4. Visualizing our data

Finally, using the wordcloud package with visualize our results. The most used words in the tweets with the datascience hashtag.

wordcloud(dm$word, dm$freq, random.order = FALSE,scale=c(4,.5), 
          colors = brewer.pal(8, "Dark2"))

It is evident that machine learning and big data are the most used words. Machine learning is the most used technique in data science and big data a big challenge for data science. Along with them, frequently referenced with data science are words like analytics, deep learning, IoT (Internet of Things) and ,ironically, python. You may even wonder what kirkdborne means. Kirk D. Borne is a data scientist, blogger, publisher and public speaker. That is why he is referenced so often.

A Simple Twitter Data Mining Trip

Antreas Antoniou

1. Installing and loading the libraries.

2. Connecting to Twitter and downloading tweets

3. Cleaning the data and creating the frequencies data frame

4. Visualizing our data