You'll need the code I've put in a package called sentiment
on github. This is easiest to install using the devtools
package.
install.packages("devtools")
require(devtools)
install_github("sentiment1", "spacedman")
You also need a few prerequisites - install these if you haven't got them:
require(twitteR)
require(RColorBrewer)
require(plyr)
require(stringr)
require(ggplot2)
require(wordcloud)
We'll just use the twitteR
package to get the most recent tweets with the nhs
hashtag.
nhsTweets = searchTwitter("#nhs", n = 1500)
For this I'm using a word list created by the Computational Story Lab. Each word has been rated happy or sad by Amazon's Mechanical Turk process, and the score for each tweet is the average score of all the words in the tweet that appear in the list.
I also do a bit of filtering to take out tweets that don't have any of the words in the list, and I also remove duplicate tweets. I also filter out the words “nhs” and “via” (which appears when a tweet is passed on) as well as retweets (anything starting with “rt” and space, URLs, punctuation, numbers, and control characters.
That gives us a number of tweets with a score and a created date.
data(labMT)
LMScoref = fScoreLM(labMT, "happiness_average")
setScoreTweets(nhsTweets, LMScoref)
## Loading required package: stringr
nhs = ldply(nhsTweets, function(x) {
data.frame(day = as.Date(x$created), created = x$created, score = x$score,
text = x$text)
})
nhs = nhs[!is.na(nhs$score), ]
nhs$filtered = laply(nhs$text, function(x) {
filter(x)
})
nhs = nhs[!duplicated(nhs$filtered), ]
nhs = nhs[order(nhs$score), ]
Let's see what we've got. First, the unhappiest tweets (with scores):
and then the happiest:
Consider that the average happiness score in the comparison word list is 5.3752 units.
This leaves us with a total of 1239 tweets. Let's plot them over time and look at a histogram of the score.
ggplot(data = nhs, aes(x = created)) + geom_histogram()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.
ggplot(data = nhs, aes(x = score)) + geom_histogram(aes(y = ..density..)) +
geom_density()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.
Everybody loves wordclouds right? Okay, maybe not. Let's divide up the tweets into three equal-sized groups ordered by score and do a wordcloud of the two extreme groups.
qs = quantile(nhs$score, probs = seq(0, 1, len = 4), na.rm = TRUE)
nhs$scoreQ = cut(nhs$score, qs)
pal = brewer.pal(11, "Spectral")
p1 = brewer.pal(9, "OrRd")[5:9]
p2 = brewer.pal(9, "PuBu")[5:9]
wordcloud(nhs[as.numeric(nhs$scoreQ) == 1, ]$filtered, max.words = 100, color = p1,
random.order = FALSE)
## Loading required package: tm
wordcloud(nhs[as.numeric(nhs$scoreQ) == 3, ]$filtered, max.words = 100, color = p2,
random.order = FALSE)
First we have the words found in the unhappiest tweets, and then the words found in the happiest tweets.
First lets get a number of tweets for the end of February:
dailyTweets = list2df(getDailies("#nhs", 2013, 2, 21:28, 120, LMScoref))
ggplot(dailyTweets, aes(factor(when), score)) + geom_boxplot()
## Warning: Removed 11 rows containing non-finite values (stat_boxplot).
Obviously plenty of other things can be done here. Other visualisations, other means of sentiment analysis and so on. I'd like to investigate using the tm
package and treating the tweets as a corpus data structure. The statistical measures could also be a bit more robust - currently only one or two words are enough to rate a tweet at an extreme of the scale, but in only 140 characters there's not a lot you can do.
I'd also like to be able to get more historical tweets - currently I only seem to be able to get back a few days with the searchTwitter
function.