I recently posted a map of New York State showing the exemption rates for measles in each school district. That prompted a few conversations with colleagues about the anti-vaxxer movement that has posed a significant challenge to public health ever since that infamous paper erroneously linking the MMR vaccine to autism was published in the Lancet. Anti-vaxxers often congregate in the comment sections of blogs and online forums. They discuss their fears and issues with vaccines openly in these pages. Reading this material is a good way to understand the motivations of antivaxxers. Naturally, looking through the write-ups and hundreds of comments posted on these pages is not a very fun task, especially if you are a public health practitioner who feels strongly about vaccines. We can delegate this task to R, by using text/sentiment analysis to understand what anti-vaxxers are most frequently talking about in order to effectively develop communication strategies to allay their fears.
The wordcloud.
In this document, I will:
I obtained the material for the analysis from 3 anti-vaxxer websites. You can check them out here, here, and here. I simply copied and pasted the entire text from these websites into NotePad because I haven’t as yet had the opportunity to learn the basics of web-scraping, and also because the bulk of the content was represented by the blog posts. I did some minor edits with the comments in Notepad, removing usernames manually. Next time around, I hope to be more efficient.
If you would like to reproduce the analysis, I’ve embedded the material within this post. If you are reading this on a PDF and cannot download the material, you can access the HTML version of the post here, and access the file from the link provided there.
xfun::embed_file('antivax.docx')
Download antivax.docx
Download the material, and open it with a text editor to see the 48 pages of raw text that I analyzed.
The tm, wordcloud and syuzhet packages are really cool to work with, as you will see shortly.
library("tm")
library("ggplot2")
library("wordcloud")
library("wesanderson")
library("syuzhet")
As mentioned earlier, I stored all of the text, complete with hyperlinks, punctuations and numbers as a .txt file on Notepad. Let’s read in the data.
text <- readLines("antivaxraw.txt")
options( header = FALSE,
stringsAsFactors = FALSE,
FileEncoding = "latin1")
We then transform the text file into a corpus.
corpus <- Corpus(VectorSource(text))
This corpus is a list of 416 objects, each of which has a slot called ‘content’. In other words, each object in the corpus can be thought of as representing a line of text. These can be accessed in the following way.
head(corpus$content)[1] #access the first line of text
## [1] "Given the number of vaccines children receive today, many parents are naturally concerned whether their child may be getting too many, or too many at one time. But instead of taking their legitimate concern seriously, public health officials and the mainstream corporate media just insult their intelligence and straight up lie to them in order to manipulate them into compliance."
Notice that the corpus contains punctuations, numbers, a mix of upper and lower cases, special characters and so on. We will clean these using the following commands, which are pretty self-explanatory.
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
At this stage, the corpus still contains words like ‘this’, ‘that’, ‘won’t’, etc. These are called stopwords. View the list of stopwords recognized by R by calling the stopwords function.
#what are the english stopwords?
stopwords("english")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very" "one"
## [176] ""
These need to be removed from our corpus.
cleanset <- tm_map(corpus, removeWords, stopwords("english"))
The first time I ran the analysis, I skipped this step. I noticed that my wordcloud had a lot of words like ‘can’ and ‘will’ that weren’t very meaningful. So, I came back here and removed them from my corpus.
cleanset <- tm_map(cleanset, removeWords, c("also", "even", "just", "say", "get", "can", "will", "thank"))
Also, when I looked at the wordcloud, I noticed that words like vaccines, vaccine, vaccination, and vaccinations were being represented as separate entities. I used a series of gsub functions wrapped within a content_transformer in the tm_map call to lump these terms together. I tried a regular gsub without the wrapper but I was getting errors.
cleanset<-tm_map(cleanset, content_transformer(function(x) gsub(x, pattern = "vaccines", replacement = "vaccine")))
cleanset<-tm_map(cleanset, content_transformer(function(x) gsub(x, pattern = "diseases", replacement = "disease")))
cleanset<-tm_map(cleanset, content_transformer(function(x) gsub(x, pattern = "vaccinations", replacement = "vaccine")))
cleanset<-tm_map(cleanset, content_transformer(function(x) gsub(x, pattern = "vaccination", replacement = "vaccine")))
cleanset<-tm_map(cleanset, content_transformer(function(x) gsub(x, pattern = "vaccinated", replacement = "vaccine")))
cleanset<-tm_map(cleanset, content_transformer(function(x) gsub(x, pattern = "children", replacement = "child")))
cleanset<-tm_map(cleanset, content_transformer(function(x) gsub(x, pattern = "autistic", replacement = "autism")))
Finally, I noticed that the most common word was ‘vaccine’. I did not think it was useful to include this in the wordcloud. After all, we are looking at antivaxxer blogs and the word vaccine is sure to come up multiple times, while attitudes towards vaccination are more important to us. So, I removed the word.
cleanset <- tm_map(cleanset, removeWords, c("vaccine"))
Now, let us get rid of the white spaces that remain wherever a stopword was a removed from.
cleanset <- tm_map(cleanset, stripWhitespace)
After all that cleaning, let’s see what our raw data now looks like:
inspect(cleanset[1])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [1] given number child receive today many parents naturally concerned whether child may getting many many time instead taking legitimate concern seriously public health officials mainstream corporate media insult intelligence straight lie order manipulate compliance
Only the key words remain. Getting the text into this format greatly aids analysis. Now let us convert this list of 416 sentences into a term document matrix.
As the reply here explains, a term document matrix contains every unique word as a row, and has columns for each ‘sentence’ in the corpus. The entry under the column specifies the number of times the word appeared in that column. For example, look at the fourth sentence.
head(corpus$content)[4]
## [1] "in the cdc recommended doses of vaccines by age six by the childhood schedule had expanded to include doses of vaccines by age six since the cdc has recommended doses of vaccines by age six this has naturally led many parents to wonder whether getting too many vaccines or too many at once might be potentially harmful to their child"
Take the 7th word in this sentence, ‘vaccines’. The row for this unique word ‘vaccines’, would, under the column for sentence 4, contain the number 4, because ‘vaccines’ appears 4 times in sentence 4.
dtm <- TermDocumentMatrix(cleanset, control = list(minwordLength = c(1, Inf)))
We are ready to carry out the analysis. Let us find the most frequently occurring terms. We can use the findFreqTerms function for this.
findFreqTerms(dtm, lowfreq = 5)
## [1] "child" "concern" "concerned"
## [4] "corporate" "getting" "given"
## [7] "health" "instead" "lie"
## [10] "mainstream" "many" "may"
## [13] "media" "number" "order"
## [16] "parents" "public" "receive"
## [19] "taking" "time" "today"
## [22] "whether" "among" "lies"
## [25] "receiving" "safety" "told"
## [28] "cdc<U+0092>s" "childhood" "schedule"
## [31] "age" "cdc" "doses"
## [34] "might" "potentially" "recommended"
## [37] "since" "wonder" "centers"
## [40] "control" "disease" "multiple"
## [43] "prevention" "shown" "states"
## [46] "website" "added" "along"
## [49] "already" "cause" "chronic"
## [52] "done" "effects" "emphasis"
## [55] "every" "look" "new"
## [58] "particular" "problems" "several"
## [61] "studies" "clinical" "come"
## [64] "information" "parent" "reading"
## [67] "least" "saying" "anyone"
## [70] "apparently" "conclusion" "statements"
## [73] "alone" "certainly" "current"
## [76] "immunization" "post" "problem"
## [79] "statement" "author" "claim"
## [82] "don<U+0092>t" "example" "false"
## [85] "longterm" "medical" "medicine"
## [88] "report" "stating" "support"
## [91] "acknowledged" "article" "board"
## [94] "letter" "review" "aware"
## [97] "made" "day" "despite"
## [100] "issue" "still" "truth"
## [103] "become" "believe" "choose"
## [106] "rather" "simply" "stated"
## [109] "continue" "demonstrated" "lying"
## [112] "reality" "actually" "see"
## [115] "<U+0092>ll" "coming" "lead"
## [118] "making" "much" "people"
## [121] "studied" "way" "well"
## [124] "without" "doesn<U+0092>t" "means"
## [127] "really" "study" "mean"
## [130] "becoming" "general" "like"
## [133] "make" "concerns" "long"
## [136] "rates" "following" "page"
## [139] "child<U+0092>s" "immune" "system"
## [142] "baby" "born" "doctor"
## [145] "ever" "exposed" "germs"
## [148] "handle" "know" "life"
## [151] "part" "put" "thousands"
## [154] "water" "infants" "paper"
## [157] "published" "certain" "offit"
## [160] "paul" "dollars" "including"
## [163] "later" "merck" "million"
## [166] "paid" "pharmaceutical" "share"
## [169] "times" "antigens" "contains"
## [172] "far" "hepatitis" "hepb"
## [175] "parents<U+0092>" "written" "antigen"
## [178] "body" "causes" "causing"
## [181] "response" "true" "used"
## [184] "virus" "however" "completely"
## [187] "ingredients" "otherwise" "although"
## [190] "aluminum" "considered" "drug"
## [193] "fact" "fda" "food"
## [196] "increase" "increased" "known"
## [199] "level" "neurotoxin" "required"
## [202] "contained" "smallpox" "years"
## [205] "able" "adjuvant" "adjuvants"
## [208] "blood" "previous" "use"
## [211] "great" "question" "read"
## [214] "safe" "write" "wrote"
## [217] "aren<U+0092>t" "course" "wrong"
## [220] "answer" "good" "math"
## [223] "government" "exposure" "yet"
## [226] "upon" "now" "influenza"
## [229] "numerous" "harm" "high"
## [232] "want" "willing" "autism"
## [235] "countries" "longer" "needs"
## [238] "never" "spectrum" "rate"
## [241] "said" "allow" "better"
## [244] "dose" "first" "full"
## [247] "got" "healthy" "journal"
## [250] "need" "perhaps" "please"
## [253] "point" "powerful" "reason"
## [256] "research" "risk" "take"
## [259] "thing" "reaction" "trigger"
## [262] "evidence" "whole" "common"
## [265] "cells" "either" "important"
## [268] "call" "childs" "often"
## [271] "show" "trying" "understand"
## [274] "try" "ive" "comment"
## [277] "different" "natural" "something"
## [280] "toxins" "travel" "others"
## [283] "hard" "think" "wisdom"
## [286] "anything" "basic" "clearly"
## [289] "follow" "single" "act"
## [292] "eyes" "help" "individuals"
## [295] "integrity" "intellectual" "knowledge"
## [298] "science" "scientific" "small"
## [301] "society" "ways" "work"
## [304] "lack" "let" "probably"
## [307] "real" "someone" "speaking"
## [310] "subject" "two" "clear"
## [313] "drugs" "everyone" "factors"
## [316] "negative" "news" "positive"
## [319] "related" "reported" "showed"
## [322] "therefore" "type" "year"
## [325] "adverse" "another" "based"
## [328] "case" "cases" "conditions"
## [331] "cover" "document" "efficacy"
## [334] "entire" "especially" "extremely"
## [337] "low" "med" "heart"
## [340] "power" "right" "thought"
## [343] "book" "associated" "doctors"
## [346] "patients" "state" "diet"
## [349] "family" "monitor" "showing"
## [352] "using" "view" "mandatory"
## [355] "data" "give" "major"
## [358] "personally" "supporting" "proven"
## [361] "treatment" "antiviral" "serious"
## [364] "autoimmune" "disorders" "dna"
## [367] "human" "scientists" "cell"
## [370] "development" "autoimmunity" "brain"
## [373] "form" "neurological" "potential"
## [376] "significant" "young" "factor"
## [379] "primary" "reports" "due"
## [382] "dysfunction" "pain" "within"
## [385] "aluminium" "damage" "cancer"
## [388] "working" "found" "link"
## [391] "association" "findings" "larger"
## [394] "acute" "measles" "mmr"
## [397] "mumps" "connection" "caused"
## [400] "fraud" "millions" "worse"
## [403] "weeks" "care" "nutrition"
## [406] "disseminated" "encephalomyelitis" "population"
## [409] "flood" "stand" "best"
## [412] "feel" "lives" "enough"
## [415] "women" "american" "families"
## [418] "fetal" "flu" "going"
## [421] "must" "pregnant" "week"
## [424] "boys" "thompson" "regression"
## [427] "unvaccine" "world" "green"
## [430] "immunity" "injured" "ultrasound"
## [433] "appears" "diagnosis" "free"
## [436] "gut" "back" "illness"
## [439] "leaks" "months" "side"
## [442] "talking" "happening" "keep"
## [445] "mercury" "month" "ago"
## [448] "away" "love" "started"
## [451] "corrupt" "mothers" "pharma"
## [454] "protect" "together" "brogan"
## [457] "law" "profits" "tell"
## [460] "thanks" "hope" "allowed"
## [463] "bad" "death" "deaths"
## [466] "died" "else" "evil"
## [469] "removed" "result" "sure"
## [472] "lot" "die" "live"
## [475] "fight" "money" "son"
## [478] "took" "daughter" "isn<U+0092>t"
## [481] "physical" "find" "ones"
## [484] "name" "talk" "effect"
## [487] "rest" "kids" "trust"
## [490] "big" "etc" "root"
## [493] "always" "mother" "held"
## [496] "changed" "companies" "compulsory"
## [499] "delay" "possible" "standing"
## [502] "stop" "story" "vaccinate"
## [505] "<U+0092>re" "kid" "knew"
## [508] "shot" "pay" "happen"
## [511] "though" "early" "jab"
## [514] "pregnancy" "hospital" "office"
## [517] "regarding" "normal" "things"
## [520] "injury" "everything" "symptoms"
## [523] "old" "community" "<U+0092>ve"
## [526] "claims" "mark"
The lowfreq argument specifies the least number of times that a word should occur before being counted in this list. If we want to be more selective and pick only the words that are mentioned more than 5 times, say 20 times, we can just change the argument in the following manner.
findFreqTerms(dtm, lowfreq = 20)
## [1] "child" "given" "health" "many" "parents"
## [6] "public" "time" "safety" "cdc" "disease"
## [11] "cause" "effects" "every" "look" "studies"
## [16] "information" "don<U+0092>t" "medical" "article" "truth"
## [21] "believe" "see" "much" "people" "well"
## [26] "study" "like" "make" "immune" "system"
## [31] "know" "aluminum" "years" "read" "government"
## [36] "now" "autism" "never" "need" "research"
## [41] "risk" "think" "adverse" "doctors" "mmr"
## [46] "big"
Let’s create a plot of these terms.
termFrequency <- rowSums(as.matrix(dtm))
termFrequency <- subset(termFrequency, termFrequency >= 20)
barplot(termFrequency, las = 2, col = rainbow(20))
On second thoughts, a bar graph might not be the best way to visualize this data. The words ‘child’, ‘cdc’ and ‘autism’ appear to be the most common. Let’s make a wordcloud to visualize this information more effectively.
A wordcloud is a visual representation of text data where the size or the color of words represents their frequency of occurrence/importance.
m <- as.matrix(dtm)
#sort words by descending frequency
wordFreq <- sort(rowSums(m), decreasing = TRUE)
#set a shade of grey based on frequency
set.seed(123)
grayLevels <- gray( (wordFreq+10)/(max(wordFreq) + 10) )
#wordcloud
wordcloud( words = names(wordFreq), freq=wordFreq,
min.freq = 5, random.order = F, colors = grayLevels)
You can change the number of words in the cloud by changing the max words argument.
wordcloud( words = names(wordFreq), freq=wordFreq,
max.words = 6, min.freq = 10, random.order = F,
colors = grayLevels)
You can add colors by calling the colors argument.
wordcloud( words = names(wordFreq), freq=wordFreq, min.freq = 10,
random.order = F, color = wes_palette(name = "Zissou1") )
It looks like the words ‘cdc’, ‘child’ and ‘autism’ are the most commonly occurring topics of discussion in the antivaxxer blogs. It is interesting to see the frequent references to ‘studies’, ‘autoimmune’, ‘research’ and ‘antigen’, indicating that the antivaxxer audience is atleast semi-conversant with scientific terms and concepts, and might be using this knowledge to further their own biases. The word ‘aluminium’ pops up frequently, and is often used in a negative sense to highlight the dangers of injecting foreign substances into the body. The use of words like ‘truth’, ‘lie’, ‘believe’ and ‘pharma’ provide a vague indication of the extent of distrust among the community. Is there a way of quantifying these feelings?
In sentiment analysis, each word is evaluated on ten indicators: anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative and positive. The scores for all the words in the document under all indicators are summed up to create an overall sentiment score. Let us apply it to the text from the anti-vaxx blogs. We need the ‘syuzhet’ library for this:
library("syuzhet")
s <- get_nrc_sentiment(cleanset$content)
head(s)
## anger anticipation disgust fear joy sadness surprise trust negative
## 1 2 3 2 2 2 3 1 2 3
## 2 0 2 0 0 2 0 1 0 1
## 3 0 0 0 0 1 0 0 0 0
## 4 1 1 1 1 2 1 0 0 1
## 5 1 2 1 2 1 2 0 2 1
## 6 0 2 0 0 1 1 0 1 1
## positive
## 1 4
## 2 2
## 3 1
## 4 3
## 5 3
## 6 4
The result is a description of the score for each of the 416 lines, broken down across the 10 indicators. Let’s take a look at how this plays out:
cleanset$content[2]
## [1] "among criminally irresponsible lies parents told safety child receiving today’s regimen scientifically demonstrated— babies theoretically receive safely "
s[2,]
## anger anticipation disgust fear joy sadness surprise trust negative
## 2 0 2 0 0 2 0 1 0 1
## positive
## 2 2
Sentence 2 got a score of 1 under the ‘negative’ indicator. Where did the 1 come from?
get_nrc_sentiment('irresponsible')
## anger anticipation disgust fear joy sadness surprise trust negative
## 1 0 0 0 0 0 0 0 0 1
## positive
## 1 0
The word ‘irresponsible’ has a score of 1 under the ‘negative’ indicator, and it is this 1 that we are seeing in the sentiment scores for the second sentence.
Take a look at the sentence though. You can tell that the person who wrote the sentence is angry, and distrustful (he/she mentions irresponsible lies and criminality ). Why did the sentence receive a score of 2 in the positive section?
get_nrc_sentiment('child')
## anger anticipation disgust fear joy sadness surprise trust negative
## 1 0 1 0 0 1 0 0 0 0
## positive
## 1 1
get_nrc_sentiment('receiving')
## anger anticipation disgust fear joy sadness surprise trust negative
## 1 0 1 0 0 1 0 1 0 0
## positive
## 1 1
The words child and receiving were given a score of one each. That is why the sentence received a score of 2 under the positive indicator. I was surprised to see that the word ‘criminally’ did not contribute to the negative indicator at all.
get_nrc_sentiment('criminally')
## anger anticipation disgust fear joy sadness surprise trust negative
## 1 0 0 0 0 0 0 0 0 0
## positive
## 1 0
So, sentiment analysis is not always a perfect solution, because a computer program can only read so much human emotion into words. Anyway, let us visualize it to see if we get an overall picture.
barplot(colSums(s), las = 2, col = wes_palette(name = "GrandBudapest1"),
ylab = "Count", main = "Sentiment Scores for Anti-Vaxxer Blogs")
We see a good deal of anger and fear. We see a strong negative sentiment, which is overshadowed by the positive sentiment. The explanation for the high positive sentiment is because of the repeated occurrences of the word ‘child’ throughout the document, which the computer interprets as having strong positive connotations. The high ranking for trust could be reflecting instances where people said that they lack trust (this is a disadvantage of sentiment analysis. Individual words are interpreted without any regard to the context in which they were said.)
The repeated occurrences of ‘child’, and ‘autism’ serve as a testament to the lasting damage done by Wakefield’s paper in the Lancet. The use of the word ‘cdc’, ‘truth’ and ‘research’ indicate a lack of trust. This was a quick and dirty analysis, intended to serve as a personal record of how to mine text for information. It was interesting to gain a glimpse into these blogs and document these feelings and thoughts that people have. Public health practitioners, especially those within the CDC should pay attention to the frequency with which their organization is quoted within these blogs, and try and innovate solutions to establish strong relationships with antivaxxers, with the key element being trust.
I am indebted to Bharatendra Rai for his instructional videos covering this topic.