Note: this project is still in development.

In this project we will try to get data on a particular topic from Twitter. We will use this collection of tweets to estimate the moods of the tweets (using Natural Language Processing) and also map their locations. We will just barely scratch the surface on what is possible to do with the Twitter API. In future sessions we might also show who the most important people (influencers) are on this topic for example.

Let’s start by loading a few dependencies.

library(twitteR)
library(knitr)
library(wordcloud)
library(tm)
library(RColorBrewer)
library(sentiment)
library(stringr)
library(dismo)
library(maps)
library(dplyr)
library(leaflet)
library(ggplot2)

After this we will authenticate. For this we can store our keys in the global environment in ~/.Rprofile, and then retrieve them with the Sys.getenv() function.

api_key <- Sys.getenv("api_key")
api_secret <- Sys.getenv("api_secret")
access_token <- Sys.getenv("access_token")
access_token_secret <- Sys.getenv("access_token_secret")
setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)
## [1] "Using direct authentication"

In the next code chunk we choose our topic of interest, and also set the number of tweets we need.

topic <- "bernie"
number_of_tweets <- 100

And then proceed with extracting the data based on the query.

tweets = searchTwitter(topic, n = number_of_tweets, lang = "en")
tweetFrame <- twListToDF(tweets)
tweets_text <- sapply(tweets, function(x) x$getText())
tweets_corpus <- Corpus(VectorSource(tweets_text))

And then some data cleaning.

tweets_corpus <- tm_map(tweets_corpus,
                              content_transformer(function(x) iconv(x, to = 'UTF-8-MAC', sub = 'byte')),
                              mc.cores = 1
)
tweets_corpus <- tm_map(tweets_corpus, content_transformer(tolower), mc.cores = 1)
tweets_corpus <- tm_map(tweets_corpus, removePunctuation, mc.cores = 1)
tweets_corpus <- tm_map(tweets_corpus, function(x)removeWords(x,stopwords()), mc.cores = 1)

By having a clean set of data we can create a wordcloud - so that we see what are the most common occuring words in those tweets.

pal2 <- brewer.pal(8,"Dark2")
wordcloud(tweets_corpus, min.freq = 2,max.words = 100, random.order = T, colors = pal2)

It would also be interesting to see where the people tweet from, for this we can use the awesome leaflet library and its R port. We would use the geocode function to get the geographical coordinates based on names.

searchTerm <- topic
searchResults <- searchTwitter(searchTerm, n = number_of_tweets)  
tweetFrame <- twListToDF(searchResults) 
userInfo <- lookupUsers(tweetFrame$screenName) 
userFrame <- twListToDF(userInfo) 
locatedUsers <- !is.na(userFrame$location)  
locations <- geocode(userFrame$location[locatedUsers]) 
## ZERO_RESULTS : Jerzey 
## ZERO_RESULTS :  
## ZERO_RESULTS : omw+to+graduation 
## [1] "try 2 ..."
## ZERO_RESULTS : Jerzey 
## ZERO_RESULTS :  
## ZERO_RESULTS : omw+to+graduation 
## [1] "try 3 ..."
## ZERO_RESULTS : Jerzey 
## ZERO_RESULTS :  
## [1] "try 4 ..."
## [1] "try 5 ..."
## [1] "try 6 ..."
geo_data <- locations %>%
    select(longitude, latitude) %>%
    na.omit()

leaflet() %>%
    setView(lat = 43.7025989, lng = -75.5094808, zoom = 3) %>%
    addProviderTiles("Esri.WorldStreetMap") %>%
    addMarkers(lng = geo_data$longitude, lat = geo_data$latitude, clusterOptions = markerClusterOptions())

Now for probably the most difficult part of Twitter analysis - data cleaning. Tweets are often a heterogenous mix of characters and strings (there are some efforts to decode emojis as well) that might be somewhat difficult to parse:

# remove retweet entities
tweets_text = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", tweets_text)
# remove at people
tweets_text = gsub("@\\w+", "", tweets_text)
# remove punctuation
tweets_text = gsub("[[:punct:]]", "", tweets_text)
# remove numbers
tweets_text = gsub("[[:digit:]]", "", tweets_text)
# remove html links
tweets_text = gsub("http\\w+", "", tweets_text)
# remove unnecessary spaces
tweets_text = gsub("[ \t]{2,}", "", tweets_text)
tweets_text = gsub("^\\s+|\\s+$", "", tweets_text)

# define "tolower error handling" function 
try.error = function(x)
{
    # create missing value
    y = NA
    # tryCatch error
    try_error = tryCatch(tolower(x), error = function(e) e)
    # if not an error
    if (!inherits(try_error, "error"))
        y = tolower(x)
    # result
    return(y)
}
# lower case using try.error with sapply 
tweets_text = sapply(tweets_text, try.error)

# remove NAs in some_txt
tweets_text = tweets_text[!is.na(tweets_text)]
names(tweets_text) = NULL

Then we can proceed with some sentiment analysis, and show the results in a nicely formatted table.

sentiments <- c()
for (tweet in tweets_text) {
    sentiments[tweet] <- sentiment(tweet)
}

subset.tweets.text <- tweets_text[1:100]
subset.tweets.sentiments <- sentiment(tweets_text[1:100])$polarity

sentiment.df <- data.frame(tweetFrame$screenName, subset.tweets.text, subset.tweets.sentiments)
kable(sentiment.df[1:20,])
tweetFrame.screenName subset.tweets.text subset.tweets.sentiments
ADudenhoefer democrats should give bernieof billarys delegates cuz he earned em n he aint goingprison e neutral
AlanSuderman spike lee backs bernie sanderswake up south carolina neutral
BrandenKooper spike lee endorses bernie sanders for president says he will do the right thing in the white house neutral
cunninghammeli1 hillary gt bs
you might wanna check yo ur boy bernies desperate amp corrupt campaign tricks oops ツ neutral
kytja after nevada the threats from the bernie of bust crowd grow louder neutral
naq2GcHGdV5nDxR bernie sanders wont win if he hasof the millennial support but onlyshow up to vote neutral
PlatinumChef what i like about bernie sandersvia positive
the_fire_berns even if youre not voting for bernie you could admire the fact that hes real as fuck neutral
MoriahReinholz new blog post fund the power re spike leebernie sanders radio ad neutral
EamesChristian spike lee backs sanders in radio adap photo neutral
urubugonzales sos bernie curious but cautious beginner save b noon tues when nyc acc starts killinghttp neutral
_amandacardenas bybernie adopts republican smearmachine tactics
hillary is a powerhungr y opportunist in it for herself neutral
sk45202 floridians dont need bernie sanders to get affordable college they just need to do work in high school brightfutures negative
cmcneilstein what i like about bernie sandersvia positive
jojoblank3 that fade isntminutes shows how little bernie cares about black issues neutral
kaycey55 lets goooo bernie neutral
lookatMEYAnow delegate count leaving bernie sanders with steep climb neutral
DcLarson_ spike lee endorsesin south carolina radio ad bernie takes no money from corporations nada neutral
BUSHADEMOCRAT bernie sanders releases all zero speeches hes given to wall street neutral
catguardian spike lee endorses bernie sanders for president says he will do the right thing in the white house neutral

And finally lets count the sentiments:

barplot(sort(table(sentiment.df$subset.tweets.sentiments)), col = "lightblue", main = "Twitter Sentiment Analysis", xlab = "Sentiment", ylab = "Count")

In the end there are many things left to be improved. For example perhaps we should use a better package for sentiment analysis, since some of the results are obviously missclassified. There is also definitely room for improvement on our data cleaning regex. But I hope this demonstrates the power of R and the Twitter API to derive meaningful insights.