NHS Twitter Sentiment Analysis

Preliminaries

You'll need the code I've put in a package called sentiment on github. This is easiest to install using the devtools package.

install.packages("devtools")
require(devtools)
install_github("sentiment1", "spacedman")

You also need a few prerequisites - install these if you haven't got them:

require(twitteR)
require(RColorBrewer)
require(plyr)
require(stringr)
require(ggplot2)
require(wordcloud)

Read The Latest Tweets

We'll just use the twitteR package to get the most recent tweets with the nhs hashtag.

nhsTweets = searchTwitter("#nhs", n = 1500)

Compute The Score

For this I'm using a word list created by the Computational Story Lab. Each word has been rated happy or sad by Amazon's Mechanical Turk process, and the score for each tweet is the average score of all the words in the tweet that appear in the list.

I also do a bit of filtering to take out tweets that don't have any of the words in the list, and I also remove duplicate tweets. I also filter out the words “nhs” and “via” (which appears when a tweet is passed on) as well as retweets (anything starting with “rt” and space, URLs, punctuation, numbers, and control characters.

That gives us a number of tweets with a score and a created date.

data(labMT)
LMScoref = fScoreLM(labMT, "happiness_average")
setScoreTweets(nhsTweets, LMScoref)

## Loading required package: stringr

nhs = ldply(nhsTweets, function(x) {
    data.frame(day = as.Date(x$created), created = x$created, score = x$score, 
        text = x$text)
})
nhs = nhs[!is.na(nhs$score), ]
nhs$filtered = laply(nhs$text, function(x) {
    filter(x)
})
nhs = nhs[!duplicated(nhs$filtered), ]
nhs = nhs[order(nhs$score), ]

A Few Samples

Let's see what we've got. First, the unhappiest tweets (with scores):

(3.4) Goodbye Marlboro #marlboro #smoking #nicoteine #nhs http://t.co/pv1KJqun2Y
(3.9086) Labour killing #NHS patients ? Proves Blair didn't just hate Iraqi's
(4.085) Too fat for #NHS surgery…?! http://t.co/eNnb2oVJ4K
(4.2629) Mid Staffs admin threat reveals Tory NHS attack plan http://t.co/Is9T4A5omI via @skwalker1964 #NHS attack #ConDems #FiscalEugenics #CuiBono!
(4.27) WHO downplays Fukushima cancer risk http://t.co/xOHpFJGamL #Science #NHS #TCSC

and then the happiest:

(7.22) @Narglezous #paris #NHS #Cheers #Sheffield
(7.22) Cardioviva heart health probiotic achieves FDA GRAS http://t.co/FvgRJHLPX7 #healthinnovations #pharma #fda #nhs
(7.07) Open lunch baby! #NHS #Smart #GoodFood
(6.9933) Great new #NHS website http://t.co/gJnN9f5Uxj
(6.9933) “@Scoscarwood: Great new #NHS website http://t.co/xaMTtIGFU0”@marcuschown

Consider that the average happiness score in the comparison word list is 5.3752 units.

Simple Plots

This leaves us with a total of 1239 tweets. Let's plot them over time and look at a histogram of the score.

ggplot(data = nhs, aes(x = created)) + geom_histogram()

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.

plot of chunk firstplots

ggplot(data = nhs, aes(x = score)) + geom_histogram(aes(y = ..density..)) + 
    geom_density()

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.

plot of chunk firstplots

Wordclouds

Everybody loves wordclouds right? Okay, maybe not. Let's divide up the tweets into three equal-sized groups ordered by score and do a wordcloud of the two extreme groups.

qs = quantile(nhs$score, probs = seq(0, 1, len = 4), na.rm = TRUE)
nhs$scoreQ = cut(nhs$score, qs)
pal = brewer.pal(11, "Spectral")
p1 = brewer.pal(9, "OrRd")[5:9]
p2 = brewer.pal(9, "PuBu")[5:9]
wordcloud(nhs[as.numeric(nhs$scoreQ) == 1, ]$filtered, max.words = 100, color = p1, 
    random.order = FALSE)

## Loading required package: tm

plot of chunk quants

wordcloud(nhs[as.numeric(nhs$scoreQ) == 3, ]$filtered, max.words = 100, color = p2, 
    random.order = FALSE)

plot of chunk quants

First we have the words found in the unhappiest tweets, and then the words found in the happiest tweets.

Sentiment in Time

First lets get a number of tweets for the end of February:

dailyTweets = list2df(getDailies("#nhs", 2013, 2, 21:28, 120, LMScoref))
ggplot(dailyTweets, aes(factor(when), score)) + geom_boxplot()

## Warning: Removed 11 rows containing non-finite values (stat_boxplot).

plot of chunk endfeb

Further

Obviously plenty of other things can be done here. Other visualisations, other means of sentiment analysis and so on. I'd like to investigate using the tm package and treating the tweets as a corpus data structure. The statistical measures could also be a bit more robust - currently only one or two words are enough to rate a tweet at an extreme of the scale, but in only 140 characters there's not a lot you can do.

I'd also like to be able to get more historical tweets - currently I only seem to be able to get back a few days with the searchTwitter function.