Since their inception, vaccines have been a point of debate among individuals, as well as at the center of political decision making. Vaccines are designed to prevent illness and death at the hands of certain bacterial and viral pathogens. Despite high levels of success directly attributed to their use on a mass scale, such as the eradication of small pox (Strasburg,1980) and polio (CDC.gov) in the United States, throughout history, there has been concerns and push-back among individuals related to potential side effects and consequences of use. Push back dubbed the “Anti-Vax Movement” has become even more prominent in recent years (Hussein, et.al), which has been a cause for concern relating to the potential return of previously eradicated diseases as certain individuals turn away from vaccines en masse.
Reddit is a forum-based social media platform that allows users to create forums also known as ‘subreddits’ based on specific topics and have discussions with other users within them. This study focuses on the r/VaccineMyths subreddit. This particular subreddit focuses on dispelling myths surrounding vaccinations. Using data previously extracted from the subreddit, we will explore the following questions through sentiment analysis:
The data set in this study uses posts previously extracted from the r/VaccineMyths subreddit through an open source data resource website called Kaggle.
Library Loading:
Each of the following libraries will allow for the sentiment analysis and visualization of sentiments.
library(tidytext)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
install.packages("vader")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("wordcloud2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(vader)
library(wordcloud2)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(ggplot2)
install.packages("readr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("dpylr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Warning: package 'dpylr' is not available for this version of R
##
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
library(readr)
library(dplyr)
Data Loading
reddit_vm <- read_csv("../project/reddit_vm.csv")
## Rows: 1602 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): title, id, url, body
## dbl (3): score, comms_num, created
## dttm (1): timestamp
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(reddit_vm)
## # A tibble: 6 × 8
## title score id url comms…¹ created body timestamp
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr> <dttm>
## 1 Health Canada app… 7 lt74… http… 0 1.61e9 <NA> 2021-02-27 06:33:45
## 2 COVID-19 in Canad… 2 lsh0… http… 1 1.61e9 <NA> 2021-02-26 07:11:07
## 3 Coronavirus varia… 6 lohl… http… 0 1.61e9 <NA> 2021-02-21 07:50:08
## 4 Canadian governme… 1 lnpt… http… 0 1.61e9 <NA> 2021-02-20 06:35:13
## 5 Canada: Pfizer is… 6 lksl… http… 0 1.61e9 <NA> 2021-02-16 11:36:28
## 6 Canada: Oxford-As… 5 lftb… http… 0 1.61e9 <NA> 2021-02-09 13:17:11
## # … with abbreviated variable name ¹comms_num
This data set contain 1602 instances (rows) across 8 features (columns). The data set contains information on the following:
title : title of post
score: score of post based on the number of upvotes vs. downvotes
id: unique id for each posts or comment
url: url of post thread
commns_num: number of comment on post
created: data of post creation
body: post or comment text
timestamp: time of post or comment creation
Data Preprocessing and Exploration
Some of the features of this data set are extraneous given the task, so they will be removed. The “created” and “timestamp” have the same information, so one of them will be removed from the data set. Because “timestamp” has a clear time data in comparison with the “created” column, it will be kept. In addition, “id” is extraneous for this particular sentiment analysis, so it will be removed as well.
reddit_vm <- reddit_vm %>%
select(title,
score,
comms_num,
body,
timestamp)
reddit_vm
## # A tibble: 1,602 × 5
## title score comms…¹ body timestamp
## <chr> <dbl> <dbl> <chr> <dttm>
## 1 Health Canada approves AstraZeneca C… 7 0 <NA> 2021-02-27 06:33:45
## 2 COVID-19 in Canada: 'Vaccination pas… 2 1 <NA> 2021-02-26 07:11:07
## 3 Coronavirus variants could fuel Cana… 6 0 <NA> 2021-02-21 07:50:08
## 4 Canadian government to extend COVID-… 1 0 <NA> 2021-02-20 06:35:13
## 5 Canada: Pfizer is 'extremely committ… 6 0 <NA> 2021-02-16 11:36:28
## 6 Canada: Oxford-AstraZeneca vaccine a… 5 0 <NA> 2021-02-09 13:17:11
## 7 Comment 1 0 "You… 2019-03-25 02:34:53
## 8 Fuck you anti-vaxxing retards 10 8 "htt… 2020-04-23 20:23:42
## 9 Comment 0 0 "Bec… 2020-04-24 23:19:50
## 10 Comment 0 0 "Wha… 2019-03-25 02:45:21
## # … with 1,592 more rows, and abbreviated variable name ¹comms_num
DISCLAIMER: There is foul language in some of the posts, but because of several instances with them having higher impact than other posts, they have not been filtered out.
There are now 1,602 instances and 5 columns. Upon inspection of the new data frame, it does appear that there are missing values particularly from the “body” column. These particular missing values, however, seem to still have an impact on the “score” column due to them being linking external urls that were large points of discussion, so they will stay for now.
Next, the number of comments vs. original posts will be inspected. This is to gauge some of the differences between the two.
filter_vm <- reddit_vm %>%
filter(title == "Comment")
filter_vm
## # A tibble: 1,133 × 5
## title score comms_num body timestamp
## <chr> <dbl> <dbl> <chr> <dttm>
## 1 Comment 1 0 "Your OP. It's not a myth. Only … 2019-03-25 02:34:53
## 2 Comment 0 0 "Because Anti-Vaxxers have no se… 2020-04-24 23:19:50
## 3 Comment 0 0 "What do you mean by \"your OP\"… 2019-03-25 02:45:21
## 4 Comment 1 0 "When they say there's no thimer… 2019-03-25 02:35:47
## 5 Comment 2 0 "The \"myth\" you're debunking i… 2019-03-25 05:50:20
## 6 Comment 2 0 "You'll have to read it again be… 2019-03-25 05:40:03
## 7 Comment 3 0 "Nope. I didn't say anything abo… 2019-03-25 05:54:10
## 8 Comment 1 0 "I didn't say thimerosal is merc… 2019-03-25 05:50:41
## 9 Comment 1 0 "Doctors recommend vaccines for … 2019-03-29 16:36:02
## 10 Comment 1 0 "I'm saying that even if you liv… 2019-03-29 16:15:38
## # … with 1,123 more rows
Using the filter function for “comment” since it is the easiest to inspect due to being made up of one word, there are 1,133 comments. Subtracting 1,133 from the original 1,602 instances, that leaves 469 original posts.
Now, the data we will look to see the top 30 posts based on its score that it has. On Reddit, generally the higher the score, generally the more influence or impact it has.
filter_vm <- reddit_vm %>%
filter(rank(desc(score))<=30)
arrange(reddit_vm, desc(score))
## # A tibble: 1,602 × 5
## title score comms…¹ body timestamp
## <chr> <dbl> <dbl> <chr> <dttm>
## 1 I would rage if this was handed to m… 1187 595 <NA> 2014-04-02 05:32:42
## 2 From /r/Rage 45 13 <NA> 2014-04-02 23:01:49
## 3 Vaccines exposed 38 4 <NA> 2020-12-18 05:11:12
## 4 Do not give a platform for anti-vaxx… 32 5 "I a… 2019-02-12 01:04:08
## 5 Meet my friend's anti-vax wife 32 1 <NA> 2014-04-29 22:47:41
## 6 Vaccines have a huge side effect 30 13 "Vac… 2019-06-08 11:16:35
## 7 How ironic 28 3 <NA> 2020-03-06 11:19:39
## 8 Oh no! I got vaccinated! 28 6 "\n\… 2018-11-21 20:35:02
## 9 Vaccinate folks. 27 7 <NA> 2020-01-09 03:25:37
## 10 This is one of the best explanations… 26 15 "&#x… 2019-08-05 02:13:47
## # … with 1,592 more rows, and abbreviated variable name ¹comms_num
filter_vm
## # A tibble: 28 × 5
## title score comms…¹ body timestamp
## <chr> <dbl> <dbl> <chr> <dttm>
## 1 tElEkInEtIc WaVeS 21 5 <NA> 2021-04-05 07:49:39
## 2 If someone tells you the vaccine con… 20 12 "For… 2021-01-29 19:05:52
## 3 Vaccines exposed 38 4 <NA> 2020-12-18 05:11:12
## 4 Delicious 22 1 <NA> 2020-12-08 02:35:39
## 5 How ironic 28 3 <NA> 2020-03-06 11:19:39
## 6 Hmm 20 1 <NA> 2020-02-10 22:24:38
## 7 Vaccinate folks. 27 7 <NA> 2020-01-09 03:25:37
## 8 Anti vax billboard in my town!!! Help 23 18 <NA> 2019-12-24 02:37:23
## 9 Brave Baby Yoda. 21 5 <NA> 2019-12-06 23:56:58
## 10 This is one of the best explanations… 26 15 "&#x… 2019-08-05 02:13:47
## # … with 18 more rows, and abbreviated variable name ¹comms_num
The top post has a very high score of 1,187 with a total of 595 comments. Upon inspection, it does not have a body, but is indicative of a repost that has clearly gotten a lot of attention. Looking at the other top posts, there is a good mix of opinions, but many of them are seemingly opposed to vaccines, which is interesting.
reddit_vm$timestamp <- as.character(reddit_vm$timestamp, format = "%Y-%m")
reddit_vm <- separate(reddit_vm, col = timestamp, into = c("Year","Month"), sep = "-")
reddit_vm
## # A tibble: 1,602 × 6
## title score comms…¹ body Year Month
## <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Health Canada approves AstraZeneca COVID-19 … 7 0 <NA> 2021 02
## 2 COVID-19 in Canada: 'Vaccination passports' … 2 1 <NA> 2021 02
## 3 Coronavirus variants could fuel Canada's thi… 6 0 <NA> 2021 02
## 4 Canadian government to extend COVID-19 emerg… 1 0 <NA> 2021 02
## 5 Canada: Pfizer is 'extremely committed' to m… 6 0 <NA> 2021 02
## 6 Canada: Oxford-AstraZeneca vaccine approval … 5 0 <NA> 2021 02
## 7 Comment 1 0 "You… 2019 03
## 8 Fuck you anti-vaxxing retards 10 8 "htt… 2020 04
## 9 Comment 0 0 "Bec… 2020 04
## 10 Comment 0 0 "Wha… 2019 03
## # … with 1,592 more rows, and abbreviated variable name ¹comms_num
The columns are now separated with 1,602 instances and 6 columns. I now want to create a visualization to see which years had the highest number of posts.
ggplot(data = reddit_vm) +
geom_bar(mapping = aes(x = Year, color = Year))
The bar chart visualization shows that there was a surge in posts in 2019. Based on this information, it can be inferred that these discussions saw a surge due to the Covid-19 pandemic.
Text Preprocessing
The first step to text preprocessing is breaking down the text into unigrams that are easier to understand and work with for analysis purposes such as word count and word cloud.
reddit_vm <- reddit_vm %>%
unnest_tokens(output = word,
input = title) %>%
unnest_tokens(output = word,
input = body)
By merging the text into a new column by unigrams, The original data set has gone from 1,602 instances to 219,232.
Now, in order to remove words that don’t necessarily help show sentiment, the removal of stop words must be coded. I also chose to drop any missing words that may have shown up from earlier in the data.
redditvm_tokens2<- anti_join(reddit_vm,
stop_words,
by = "word") %>%
drop_na()
redditvm_tokens2
## # A tibble: 97,994 × 5
## score comms_num Year Month word
## <dbl> <dbl> <chr> <chr> <chr>
## 1 1 0 2019 03 op
## 2 1 0 2019 03 myth
## 3 1 0 2019 03 vaccine
## 4 1 0 2019 03 op
## 5 1 0 2019 03 pointless
## 6 1 0 2019 03 flex
## 7 10 8 2020 04 https
## 8 10 8 2020 04 youtu.be
## 9 10 8 2020 04 zbkvcpbnnku
## 10 10 8 2020 04 https
## # … with 97,984 more rows
The data has dropped to 97,994 instances. However, seeing that there were still some unecessary words, phrases and characters that were showing up prominently, I then create a custom stop word list.
my_stopwords <- c("webp", "=", "+", "x200b", " https", "x1zr9qxkeie31", "png", " 597", "c1cf44bdac7cb6709f564e335607e29445f7ee70", "39k74geueie31", "9da0ea22b7761f94bda08626345d208b0cd3c272", "tyao7ypveie31", "preview.redd.it ","16ba5eecad3b33c5ea0a4e92a7774272c46d31cc", "1p7sf4qweie31", "e3d8874c19c22d7bf670219bb39001915c0cc3d4", "553", "586", "597", "593", "preview.redd.it", "auto", "https", "width", "557", "format", "it's", "zbkvcpbnnku", "youtu.be", "doi", "abs", "10.1146", "1", "el", "la", "de", "3", "en", "http")
redditvm_tokens3 <-
redditvm_tokens2 %>%
filter(!word %in% my_stopwords)
redditvm_tokens3
## # A tibble: 92,580 × 5
## score comms_num Year Month word
## <dbl> <dbl> <chr> <chr> <chr>
## 1 1 0 2019 03 op
## 2 1 0 2019 03 myth
## 3 1 0 2019 03 vaccine
## 4 1 0 2019 03 op
## 5 1 0 2019 03 pointless
## 6 1 0 2019 03 flex
## 7 0 0 2020 04 anti
## 8 0 0 2020 04 vaxxers
## 9 0 0 2020 04 sense
## 10 0 0 2019 03 op
## # … with 92,570 more rows
This further condenses the data to 92,570 rows and 5 columns. I will then save the cleaned Reddit posts as a new data frame called tidy_redditvm.
tidy_redditvm <-redditvm_tokens3
For the first analysis task, I will look at the top tokens and then create a word cloud for easier interpretation.
redditvm_top_tokens <- tidy_redditvm %>%
count(word, sort = TRUE) %>%
top_n(50)
## Selecting by n
redditvm_top_tokens
## # A tibble: 50 × 2
## word n
## <chr> <int>
## 1 vaccine 2365
## 2 vaccines 1442
## 3 autism 814
## 4 children 684
## 5 people 664
## 6 vaccinated 622
## 7 vaccination 489
## 8 immunity 417
## 9 study 384
## 10 measles 381
## # … with 40 more rows
wordcloud2(redditvm_top_tokens)
Here, the words vaccine, autism, immunity, people vaccination, measles, studies, anti, are among the most used. This is very interesting yet not surprising considering the context.
I then decided to load the AFINN, NRC and BING Packages just to get a refresher on some of the sentiments and words associated with them before moving on to VADER.
library(tidytext)
install.packages("textdata")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
afinn <- get_sentiments("afinn")
afinn
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # … with 2,467 more rows
nrc <- get_sentiments("nrc")
nrc
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # … with 13,862 more rows
bing <- get_sentiments("bing")
bing
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
Running VADER using a sample from the original data set, I then drop all sections with missing data to get the best possible calculation.
library(here)
## here() starts at /cloud/project
redditvm_sample <- read_csv("../project/reddit_vm.csv") %>%
sample_n(500)
## Rows: 1602 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): title, id, url, body
## dbl (3): score, comms_num, created
## dttm (1): timestamp
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
redditvm_sample
## # A tibble: 500 × 8
## title score id url comms…¹ created body timestamp
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr> <dttm>
## 1 CHAD ends debate 0 phjx… http… 0 1.63e9 <NA> 2021-09-04 06:23:46
## 2 Comment -5 etdq… <NA> 0 1.56e9 "1) … 2019-07-10 01:57:50
## 3 Comment 2 gmbx… <NA> 0 1.61e9 "Wel… 2021-02-07 05:28:17
## 4 Comment 0 elrt… <NA> 0 1.56e9 "Is … 2019-04-25 22:53:50
## 5 Comment 2 gs39… <NA> 0 1.62e9 "Und… 2021-03-24 22:20:36
## 6 Comment 1 elhy… <NA> 0 1.56e9 "Wak… 2019-04-23 02:07:55
## 7 Comment 2 f71v… <NA> 0 1.57e9 "Don… 2019-11-10 03:22:14
## 8 If the evidence … 5 aalf… http… 1 1.55e9 <NA> 2018-12-29 16:03:30
## 9 Comment -2 fgsj… <NA> 0 1.58e9 "Wel… 2020-02-07 16:28:07
## 10 Comment 1 elru… <NA> 0 1.56e9 "If … 2019-04-25 23:02:04
## # … with 490 more rows, and abbreviated variable name ¹comms_num
vader_redditvm <- vader_df(redditvm_sample$body)
vader_redditvm <- vader_redditvm %>%
drop_na()
mean(vader_redditvm$compound)
## [1] 0.01676064
Overall, with an average of 0.00695 it appears that in the sample size of 500, the sentiments appear to skew toward the negative. To get a more clear number for each sentiment, I will categorize positive, negative or neutral sentiments using the sentiment threshold classification.
vader_redditvm_summary <- vader_redditvm %>%
mutate(sentiment = ifelse(compound >= 0.05, "positive",
ifelse(compound <= -0.05, "negative", "neutral"))) %>%
count(sentiment, sort = TRUE) %>%
spread(sentiment, n) %>%
relocate(positive) %>%
mutate(ratio = negative/positive)
vader_redditvm_summary
## positive negative neutral ratio
## 1 143 139 94 0.972028
Here based on the classification, it is evident that the overall sentiment is negative, with positive slightly behind. This is in line with the initial calculation was which skews more toward neutrality.
Now, I want to visualize my results to get another picture of the overall sentiment values from the data sample.
install.packages("plotrix")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(plotrix)
slices <- c(145, 143, 98)
lbls <- c( "Negative (145)", "Positive (143)", "Neutral (98)")
pie(slices,labels=lbls,
main="Reddit Vaccine Myths Sentiments", radius= 1, col= c("Pink", "Purple", "Yellow"))
The key questions of this study were:
The findings show that the overall sentiment toward vaccines were positive, which are in fact in line with the intended purpose of the subreddit, which is to discussion and dispel myths surrounding vaccines. Despite this, however, negative sentiments came pretty close, signifying that there does seem to be some back and forth discourse between the two, despite the goal of the subreddit. One could argue that these debates, while on opposing sides, especially for the posts with higher engagement, can allow for the exchange of ideas that may lead to more education on the subject.
Since the beginning of the subreddit’s existence in 2014 until the present, this study showed that there was a surge in posts around 2019, which is representative of the Sars-cOV2 (Covid-19) outbreak. However, since discussion on vaccines have been happening prior, the subreddit can be labeled as a reflection of opinions on other vaccines in existence as well.
In the future, I am interesting in doing further manipulation and wrangling of the text to gain more accuracy in the sentiments. Though VADER was able to get a good idea of how to calculate sentiments, hand inspection of the data revealed areas where instances were incorrectly labeled as positive, negative or neutral. For example, one particular post that was mostly expletive and a negative dig at another individual was labeled as positive, most likely for the fact that the statement began with “laughing”. It is more difficult for machines to measure intent in a way that humans already understand, but I would like to get closer. I think in doing so, it will allow for a clearer picture of what is being said and done to have discourse and dispel myths on social media channels and how this can be leveraged on a larger scale. Some of the most used words surrounded government and media, which is telling.
This particular research is limited in that it may not adequately identify societal sentiments on vaccines and the myths surrounding them due to a limited sample size of Reddit data. In the future, more research may be done using a more neutral subreddit, or one that explores sentiments about vaccines given specific contexts such as r/Politics or r/WorldWatch or more generalized discussion forums such as r/science to see what the sentiments look like.
References
Hussain A, Ali S, Ahmed M, Hussain S. The Anti-vaccination Movement: A Regression in Modern Medicine. Cureus. 2018 Jul 3;10(7):e2919. doi: 10.7759/cureus.2919. PMID: 30186724; PMCID: PMC6122668.
Krumm, A., Means, B., & Bienkowski, M. (2018). Learning analytics goes to school: A collaborative approach to improving education.
Strassburg MA. The global eradication of smallpox. Am J Infect Control. 1982 May;10(2):53-9. doi: 10.1016/0196-6553(82)90003-7. PMID: 7044193.