Social Media Mining with R

Note: this project is still in development.

In this project we will try to get data on a particular topic from Twitter. We will use this collection of tweets to estimate the moods of the tweets (using Natural Language Processing) and also map their locations. We will just barely scratch the surface on what is possible to do with the Twitter API. In future sessions we might also show who the most important people (influencers) are on this topic for example.

Let’s start by loading a few dependencies.

library(twitteR)
library(knitr)
library(wordcloud)
library(tm)
library(RColorBrewer)
library(sentiment)
library(stringr)
library(dismo)
library(maps)
library(dplyr)
library(leaflet)
library(ggplot2)

After this we will authenticate. For this we can store our keys in the global environment in ~/.Rprofile, and then retrieve them with the Sys.getenv() function.

api_key <- Sys.getenv("api_key")
api_secret <- Sys.getenv("api_secret")
access_token <- Sys.getenv("access_token")
access_token_secret <- Sys.getenv("access_token_secret")
setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)

## [1] "Using direct authentication"

In the next code chunk we choose our topic of interest, and also set the number of tweets we need.

topic <- "bernie"
number_of_tweets <- 100

And then proceed with extracting the data based on the query.

tweets = searchTwitter(topic, n = number_of_tweets, lang = "en")
tweetFrame <- twListToDF(tweets)
tweets_text <- sapply(tweets, function(x) x$getText())
tweets_corpus <- Corpus(VectorSource(tweets_text))

And then some data cleaning.

tweets_corpus <- tm_map(tweets_corpus,
                              content_transformer(function(x) iconv(x, to = 'UTF-8-MAC', sub = 'byte')),
                              mc.cores = 1
)
tweets_corpus <- tm_map(tweets_corpus, content_transformer(tolower), mc.cores = 1)
tweets_corpus <- tm_map(tweets_corpus, removePunctuation, mc.cores = 1)
tweets_corpus <- tm_map(tweets_corpus, function(x)removeWords(x,stopwords()), mc.cores = 1)

By having a clean set of data we can create a wordcloud - so that we see what are the most common occuring words in those tweets.

pal2 <- brewer.pal(8,"Dark2")
wordcloud(tweets_corpus, min.freq = 2,max.words = 100, random.order = T, colors = pal2)

It would also be interesting to see where the people tweet from, for this we can use the awesome leaflet library and its R port. We would use the geocode function to get the geographical coordinates based on names.

searchTerm <- topic
searchResults <- searchTwitter(searchTerm, n = number_of_tweets)  
tweetFrame <- twListToDF(searchResults) 
userInfo <- lookupUsers(tweetFrame$screenName) 
userFrame <- twListToDF(userInfo) 
locatedUsers <- !is.na(userFrame$location)  
locations <- geocode(userFrame$location[locatedUsers])

## ZERO_RESULTS : Jerzey 
## ZERO_RESULTS :  
## ZERO_RESULTS : omw+to+graduation 
## [1] "try 2 ..."
## ZERO_RESULTS : Jerzey 
## ZERO_RESULTS :  
## ZERO_RESULTS : omw+to+graduation 
## [1] "try 3 ..."
## ZERO_RESULTS : Jerzey 
## ZERO_RESULTS :  
## [1] "try 4 ..."
## [1] "try 5 ..."
## [1] "try 6 ..."

geo_data <- locations %>%
    select(longitude, latitude) %>%
    na.omit()

leaflet() %>%
    setView(lat = 43.7025989, lng = -75.5094808, zoom = 3) %>%
    addProviderTiles("Esri.WorldStreetMap") %>%
    addMarkers(lng = geo_data$longitude, lat = geo_data$latitude, clusterOptions = markerClusterOptions())

Now for probably the most difficult part of Twitter analysis - data cleaning. Tweets are often a heterogenous mix of characters and strings (there are some efforts to decode emojis as well) that might be somewhat difficult to parse:

# remove retweet entities
tweets_text = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", tweets_text)
# remove at people
tweets_text = gsub("@\\w+", "", tweets_text)
# remove punctuation
tweets_text = gsub("[[:punct:]]", "", tweets_text)
# remove numbers
tweets_text = gsub("[[:digit:]]", "", tweets_text)
# remove html links
tweets_text = gsub("http\\w+", "", tweets_text)
# remove unnecessary spaces
tweets_text = gsub("[ \t]{2,}", "", tweets_text)
tweets_text = gsub("^\\s+|\\s+$", "", tweets_text)

# define "tolower error handling" function 
try.error = function(x)
{
    # create missing value
    y = NA
    # tryCatch error
    try_error = tryCatch(tolower(x), error = function(e) e)
    # if not an error
    if (!inherits(try_error, "error"))
        y = tolower(x)
    # result
    return(y)
}
# lower case using try.error with sapply 
tweets_text = sapply(tweets_text, try.error)

# remove NAs in some_txt
tweets_text = tweets_text[!is.na(tweets_text)]
names(tweets_text) = NULL

Then we can proceed with some sentiment analysis, and show the results in a nicely formatted table.

sentiments <- c()
for (tweet in tweets_text) {
    sentiments[tweet] <- sentiment(tweet)
}

subset.tweets.text <- tweets_text[1:100]
subset.tweets.sentiments <- sentiment(tweets_text[1:100])$polarity

sentiment.df <- data.frame(tweetFrame$screenName, subset.tweets.text, subset.tweets.sentiments)
kable(sentiment.df[1:20,])

tweetFrame.screenName	subset.tweets.text	subset.tweets.sentiments
ADudenhoefer	democrats should give bernieof billarys delegates cuz he earned em n he aint goingprison e	neutral
AlanSuderman	spike lee backs bernie sanderswake up south carolina	neutral
BrandenKooper	spike lee endorses bernie sanders for president says he will do the right thing in the white house	neutral
cunninghammeli1	hillary gt bs
you might wanna check yo	ur boy bernies desperate amp corrupt campaign tricks oops ツ neutral
kytja	after nevada the threats from the bernie of bust crowd grow louder	neutral
naq2GcHGdV5nDxR	bernie sanders wont win if he hasof the millennial support but onlyshow up to vote	neutral
PlatinumChef	what i like about bernie sandersvia	positive
the_fire_berns	even if youre not voting for bernie you could admire the fact that hes real as fuck	neutral
MoriahReinholz	new blog post fund the power re spike leebernie sanders radio ad	neutral
EamesChristian	spike lee backs sanders in radio adap photo	neutral
urubugonzales	sos bernie curious but cautious beginner save b noon tues when nyc acc starts killinghttp	neutral
_amandacardenas	bybernie adopts republican smearmachine tactics
hillary is a powerhungr	y opportunist in it for herself neutral
sk45202	floridians dont need bernie sanders to get affordable college they just need to do work in high school brightfutures	negative
cmcneilstein	what i like about bernie sandersvia	positive
jojoblank3	that fade isntminutes shows how little bernie cares about black issues	neutral
kaycey55	lets goooo bernie	neutral
lookatMEYAnow	delegate count leaving bernie sanders with steep climb	neutral
DcLarson_	spike lee endorsesin south carolina radio ad bernie takes no money from corporations nada	neutral
BUSHADEMOCRAT	bernie sanders releases all zero speeches hes given to wall street	neutral
catguardian	spike lee endorses bernie sanders for president says he will do the right thing in the white house	neutral

And finally lets count the sentiments:

barplot(sort(table(sentiment.df$subset.tweets.sentiments)), col = "lightblue", main = "Twitter Sentiment Analysis", xlab = "Sentiment", ylab = "Count")

In the end there are many things left to be improved. For example perhaps we should use a better package for sentiment analysis, since some of the results are obviously missclassified. There is also definitely room for improvement on our data cleaning regex. But I hope this demonstrates the power of R and the Twitter API to derive meaningful insights.

Social Media Mining with R

Boyan Angelov

5 Feb 2016