I. Prepare

Since their inception, vaccines have been a point of debate among individuals, as well as at the center of political decision making. Vaccines are designed to prevent illness and death at the hands of certain bacterial and viral pathogens. Despite high levels of success directly attributed to their use on a mass scale, such as the eradication of small pox (Strasburg,1980) and polio (CDC.gov) in the United States, throughout history, there has been concerns and push-back among individuals related to potential side effects and consequences of use. Push back dubbed the “Anti-Vax Movement” has become even more prominent in recent years (Hussein, et.al), which has been a cause for concern relating to the potential return of previously eradicated diseases as certain individuals turn away from vaccines en masse.

Reddit is a forum-based social media platform that allows users to create forums also known as ‘subreddits’ based on specific topics and have discussions with other users within them. This study focuses on the r/VaccineMyths subreddit. This particular subreddit focuses on dispelling myths surrounding vaccinations. Using data previously extracted from the subreddit, we will explore the following questions through sentiment analysis:

  1. What are the sentiments toward vaccines in the r/RedditMyths subreddit?
  2. Are the sentiments of the r/RedditMyths subreddit in line with its intended purpose?
  3. Which years saw the most discussion on vaccines and what inferences can be made on why?

II. Wrangle

The data set in this study uses posts previously extracted from the r/VaccineMyths subreddit through an open source data resource website called Kaggle.

Library Loading:

Each of the following libraries will allow for the sentiment analysis and visualization of sentiments.

library(tidytext)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
install.packages("vader")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("wordcloud2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(vader)
library(wordcloud2)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(ggplot2)
install.packages("readr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("dpylr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Warning: package 'dpylr' is not available for this version of R
## 
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
library(readr)
library(dplyr)

Data Loading

reddit_vm <- read_csv("../project/reddit_vm.csv")
## Rows: 1602 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): title, id, url, body
## dbl  (3): score, comms_num, created
## dttm (1): timestamp
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(reddit_vm)
## # A tibble: 6 × 8
##   title              score id    url   comms…¹ created body  timestamp          
##   <chr>              <dbl> <chr> <chr>   <dbl>   <dbl> <chr> <dttm>             
## 1 Health Canada app…     7 lt74… http…       0  1.61e9 <NA>  2021-02-27 06:33:45
## 2 COVID-19 in Canad…     2 lsh0… http…       1  1.61e9 <NA>  2021-02-26 07:11:07
## 3 Coronavirus varia…     6 lohl… http…       0  1.61e9 <NA>  2021-02-21 07:50:08
## 4 Canadian governme…     1 lnpt… http…       0  1.61e9 <NA>  2021-02-20 06:35:13
## 5 Canada: Pfizer is…     6 lksl… http…       0  1.61e9 <NA>  2021-02-16 11:36:28
## 6 Canada: Oxford-As…     5 lftb… http…       0  1.61e9 <NA>  2021-02-09 13:17:11
## # … with abbreviated variable name ¹​comms_num

This data set contain 1602 instances (rows) across 8 features (columns). The data set contains information on the following:

Data Preprocessing and Exploration

  1. Some of the features of this data set are extraneous given the task, so they will be removed. The “created” and “timestamp” have the same information, so one of them will be removed from the data set. Because “timestamp” has a clear time data in comparison with the “created” column, it will be kept. In addition, “id” is extraneous for this particular sentiment analysis, so it will be removed as well.

    reddit_vm <- reddit_vm %>% 
      select(title,
             score,
             comms_num, 
             body,
             timestamp)
    reddit_vm
    ## # A tibble: 1,602 × 5
    ##    title                                 score comms…¹ body  timestamp          
    ##    <chr>                                 <dbl>   <dbl> <chr> <dttm>             
    ##  1 Health Canada approves AstraZeneca C…     7       0  <NA> 2021-02-27 06:33:45
    ##  2 COVID-19 in Canada: 'Vaccination pas…     2       1  <NA> 2021-02-26 07:11:07
    ##  3 Coronavirus variants could fuel Cana…     6       0  <NA> 2021-02-21 07:50:08
    ##  4 Canadian government to extend COVID-…     1       0  <NA> 2021-02-20 06:35:13
    ##  5 Canada: Pfizer is 'extremely committ…     6       0  <NA> 2021-02-16 11:36:28
    ##  6 Canada: Oxford-AstraZeneca vaccine a…     5       0  <NA> 2021-02-09 13:17:11
    ##  7 Comment                                   1       0 "You… 2019-03-25 02:34:53
    ##  8 Fuck you anti-vaxxing retards            10       8 "htt… 2020-04-23 20:23:42
    ##  9 Comment                                   0       0 "Bec… 2020-04-24 23:19:50
    ## 10 Comment                                   0       0 "Wha… 2019-03-25 02:45:21
    ## # … with 1,592 more rows, and abbreviated variable name ¹​comms_num

    DISCLAIMER: There is foul language in some of the posts, but because of several instances with them having higher impact than other posts, they have not been filtered out.

    There are now 1,602 instances and 5 columns. Upon inspection of the new data frame, it does appear that there are missing values particularly from the “body” column. These particular missing values, however, seem to still have an impact on the “score” column due to them being linking external urls that were large points of discussion, so they will stay for now.

  2. Next, the number of comments vs. original posts will be inspected. This is to gauge some of the differences between the two.

filter_vm <- reddit_vm %>% 
  filter(title == "Comment") 

filter_vm
## # A tibble: 1,133 × 5
##    title   score comms_num body                              timestamp          
##    <chr>   <dbl>     <dbl> <chr>                             <dttm>             
##  1 Comment     1         0 "Your OP. It's not a myth. Only … 2019-03-25 02:34:53
##  2 Comment     0         0 "Because Anti-Vaxxers have no se… 2020-04-24 23:19:50
##  3 Comment     0         0 "What do you mean by \"your OP\"… 2019-03-25 02:45:21
##  4 Comment     1         0 "When they say there's no thimer… 2019-03-25 02:35:47
##  5 Comment     2         0 "The \"myth\" you're debunking i… 2019-03-25 05:50:20
##  6 Comment     2         0 "You'll have to read it again be… 2019-03-25 05:40:03
##  7 Comment     3         0 "Nope. I didn't say anything abo… 2019-03-25 05:54:10
##  8 Comment     1         0 "I didn't say thimerosal is merc… 2019-03-25 05:50:41
##  9 Comment     1         0 "Doctors recommend vaccines for … 2019-03-29 16:36:02
## 10 Comment     1         0 "I'm saying that even if you liv… 2019-03-29 16:15:38
## # … with 1,123 more rows

Using the filter function for “comment” since it is the easiest to inspect due to being made up of one word, there are 1,133 comments. Subtracting 1,133 from the original 1,602 instances, that leaves 469 original posts.

Now, the data we will look to see the top 30 posts based on its score that it has. On Reddit, generally the higher the score, generally the more influence or impact it has.

filter_vm <- reddit_vm %>%
         filter(rank(desc(score))<=30)
         arrange(reddit_vm, desc(score))
## # A tibble: 1,602 × 5
##    title                                 score comms…¹ body  timestamp          
##    <chr>                                 <dbl>   <dbl> <chr> <dttm>             
##  1 I would rage if this was handed to m…  1187     595  <NA> 2014-04-02 05:32:42
##  2 From /r/Rage                             45      13  <NA> 2014-04-02 23:01:49
##  3 Vaccines exposed                         38       4  <NA> 2020-12-18 05:11:12
##  4 Do not give a platform for anti-vaxx…    32       5 "I a… 2019-02-12 01:04:08
##  5 Meet my friend's anti-vax wife           32       1  <NA> 2014-04-29 22:47:41
##  6 Vaccines have a huge side effect         30      13 "Vac… 2019-06-08 11:16:35
##  7 How ironic                               28       3  <NA> 2020-03-06 11:19:39
##  8 Oh no! I got vaccinated!                 28       6 "\n\… 2018-11-21 20:35:02
##  9 Vaccinate folks.                         27       7  <NA> 2020-01-09 03:25:37
## 10 This is one of the best explanations…    26      15 "&#x… 2019-08-05 02:13:47
## # … with 1,592 more rows, and abbreviated variable name ¹​comms_num
filter_vm
## # A tibble: 28 × 5
##    title                                 score comms…¹ body  timestamp          
##    <chr>                                 <dbl>   <dbl> <chr> <dttm>             
##  1 tElEkInEtIc WaVeS                        21       5  <NA> 2021-04-05 07:49:39
##  2 If someone tells you the vaccine con…    20      12 "For… 2021-01-29 19:05:52
##  3 Vaccines exposed                         38       4  <NA> 2020-12-18 05:11:12
##  4 Delicious                                22       1  <NA> 2020-12-08 02:35:39
##  5 How ironic                               28       3  <NA> 2020-03-06 11:19:39
##  6 Hmm                                      20       1  <NA> 2020-02-10 22:24:38
##  7 Vaccinate folks.                         27       7  <NA> 2020-01-09 03:25:37
##  8 Anti vax billboard in my town!!! Help    23      18  <NA> 2019-12-24 02:37:23
##  9 Brave Baby Yoda.                         21       5  <NA> 2019-12-06 23:56:58
## 10 This is one of the best explanations…    26      15 "&#x… 2019-08-05 02:13:47
## # … with 18 more rows, and abbreviated variable name ¹​comms_num

The top post has a very high score of 1,187 with a total of 595 comments. Upon inspection, it does not have a body, but is indicative of a repost that has clearly gotten a lot of attention. Looking at the other top posts, there is a good mix of opinions, but many of them are seemingly opposed to vaccines, which is interesting.

  1. The next task is to observe trends over the years based on the post timestamps to see if there are any conclusions that can be drawn. First, I convert the timestamp data to character since it is listed in the “Posixct” format. This format lists by year, month, day and time in hours and seconds. However, for this research I only need the month and year. In order to be able to extract the column into new columns called “year” and “month” respectively, I have to first change it into the character format and then separate from there.
reddit_vm$timestamp <- as.character(reddit_vm$timestamp, format = "%Y-%m")

reddit_vm <- separate(reddit_vm, col = timestamp, into = c("Year","Month"), sep = "-")
reddit_vm
## # A tibble: 1,602 × 6
##    title                                         score comms…¹ body  Year  Month
##    <chr>                                         <dbl>   <dbl> <chr> <chr> <chr>
##  1 Health Canada approves AstraZeneca COVID-19 …     7       0  <NA> 2021  02   
##  2 COVID-19 in Canada: 'Vaccination passports' …     2       1  <NA> 2021  02   
##  3 Coronavirus variants could fuel Canada's thi…     6       0  <NA> 2021  02   
##  4 Canadian government to extend COVID-19 emerg…     1       0  <NA> 2021  02   
##  5 Canada: Pfizer is 'extremely committed' to m…     6       0  <NA> 2021  02   
##  6 Canada: Oxford-AstraZeneca vaccine approval …     5       0  <NA> 2021  02   
##  7 Comment                                           1       0 "You… 2019  03   
##  8 Fuck you anti-vaxxing retards                    10       8 "htt… 2020  04   
##  9 Comment                                           0       0 "Bec… 2020  04   
## 10 Comment                                           0       0 "Wha… 2019  03   
## # … with 1,592 more rows, and abbreviated variable name ¹​comms_num

The columns are now separated with 1,602 instances and 6 columns. I now want to create a visualization to see which years had the highest number of posts.

ggplot(data = reddit_vm) +
  geom_bar(mapping = aes(x = Year, color = Year))

The bar chart visualization shows that there was a surge in posts in 2019. Based on this information, it can be inferred that these discussions saw a surge due to the Covid-19 pandemic.

Text Preprocessing

The first step to text preprocessing is breaking down the text into unigrams that are easier to understand and work with for analysis purposes such as word count and word cloud.

reddit_vm <- reddit_vm %>%
  unnest_tokens(output = word, 
                input = title) %>%
   unnest_tokens(output = word, 
                input = body)

By merging the text into a new column by unigrams, The original data set has gone from 1,602 instances to 219,232.

Now, in order to remove words that don’t necessarily help show sentiment, the removal of stop words must be coded. I also chose to drop any missing words that may have shown up from earlier in the data.

redditvm_tokens2<- anti_join(reddit_vm,
                         stop_words,
                         by = "word") %>%
  drop_na()

redditvm_tokens2
## # A tibble: 97,994 × 5
##    score comms_num Year  Month word       
##    <dbl>     <dbl> <chr> <chr> <chr>      
##  1     1         0 2019  03    op         
##  2     1         0 2019  03    myth       
##  3     1         0 2019  03    vaccine    
##  4     1         0 2019  03    op         
##  5     1         0 2019  03    pointless  
##  6     1         0 2019  03    flex       
##  7    10         8 2020  04    https      
##  8    10         8 2020  04    youtu.be   
##  9    10         8 2020  04    zbkvcpbnnku
## 10    10         8 2020  04    https      
## # … with 97,984 more rows

The data has dropped to 97,994 instances. However, seeing that there were still some unecessary words, phrases and characters that were showing up prominently, I then create a custom stop word list.

my_stopwords <- c("webp", "=", "+", "x200b", "  https", "x1zr9qxkeie31", "png", "   597", "c1cf44bdac7cb6709f564e335607e29445f7ee70", "39k74geueie31", "9da0ea22b7761f94bda08626345d208b0cd3c272", "tyao7ypveie31", "preview.redd.it    ","16ba5eecad3b33c5ea0a4e92a7774272c46d31cc", "1p7sf4qweie31", "e3d8874c19c22d7bf670219bb39001915c0cc3d4", "553", "586", "597", "593", "preview.redd.it", "auto", "https", "width", "557", "format", "it's", "zbkvcpbnnku", "youtu.be", "doi", "abs", "10.1146", "1", "el", "la", "de", "3", "en", "http")

redditvm_tokens3 <-
  redditvm_tokens2 %>%
  filter(!word %in% my_stopwords)
redditvm_tokens3
## # A tibble: 92,580 × 5
##    score comms_num Year  Month word     
##    <dbl>     <dbl> <chr> <chr> <chr>    
##  1     1         0 2019  03    op       
##  2     1         0 2019  03    myth     
##  3     1         0 2019  03    vaccine  
##  4     1         0 2019  03    op       
##  5     1         0 2019  03    pointless
##  6     1         0 2019  03    flex     
##  7     0         0 2020  04    anti     
##  8     0         0 2020  04    vaxxers  
##  9     0         0 2020  04    sense    
## 10     0         0 2019  03    op       
## # … with 92,570 more rows

This further condenses the data to 92,570 rows and 5 columns. I will then save the cleaned Reddit posts as a new data frame called tidy_redditvm.

tidy_redditvm <-redditvm_tokens3

III. Analyze

For the first analysis task, I will look at the top tokens and then create a word cloud for easier interpretation.

redditvm_top_tokens <- tidy_redditvm %>%
count(word, sort = TRUE) %>%
  top_n(50)
## Selecting by n
redditvm_top_tokens
## # A tibble: 50 × 2
##    word            n
##    <chr>       <int>
##  1 vaccine      2365
##  2 vaccines     1442
##  3 autism        814
##  4 children      684
##  5 people        664
##  6 vaccinated    622
##  7 vaccination   489
##  8 immunity      417
##  9 study         384
## 10 measles       381
## # … with 40 more rows
wordcloud2(redditvm_top_tokens)

Here, the words vaccine, autism, immunity, people vaccination, measles, studies, anti, are among the most used. This is very interesting yet not surprising considering the context.

I then decided to load the AFINN, NRC and BING Packages just to get a refresher on some of the sentiments and words associated with them before moving on to VADER.

library(tidytext)
install.packages("textdata")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
afinn <- get_sentiments("afinn")

afinn
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows
nrc <- get_sentiments("nrc")

nrc
## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,862 more rows
bing <- get_sentiments("bing")

bing
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows

Running VADER using a sample from the original data set, I then drop all sections with missing data to get the best possible calculation.

library(here)
## here() starts at /cloud/project
redditvm_sample <- read_csv("../project/reddit_vm.csv") %>%
  sample_n(500)
## Rows: 1602 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): title, id, url, body
## dbl  (3): score, comms_num, created
## dttm (1): timestamp
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
redditvm_sample
## # A tibble: 500 × 8
##    title             score id    url   comms…¹ created body  timestamp          
##    <chr>             <dbl> <chr> <chr>   <dbl>   <dbl> <chr> <dttm>             
##  1 CHAD ends debate      0 phjx… http…       0  1.63e9  <NA> 2021-09-04 06:23:46
##  2 Comment              -5 etdq… <NA>        0  1.56e9 "1) … 2019-07-10 01:57:50
##  3 Comment               2 gmbx… <NA>        0  1.61e9 "Wel… 2021-02-07 05:28:17
##  4 Comment               0 elrt… <NA>        0  1.56e9 "Is … 2019-04-25 22:53:50
##  5 Comment               2 gs39… <NA>        0  1.62e9 "Und… 2021-03-24 22:20:36
##  6 Comment               1 elhy… <NA>        0  1.56e9 "Wak… 2019-04-23 02:07:55
##  7 Comment               2 f71v… <NA>        0  1.57e9 "Don… 2019-11-10 03:22:14
##  8 If the evidence …     5 aalf… http…       1  1.55e9  <NA> 2018-12-29 16:03:30
##  9 Comment              -2 fgsj… <NA>        0  1.58e9 "Wel… 2020-02-07 16:28:07
## 10 Comment               1 elru… <NA>        0  1.56e9 "If … 2019-04-25 23:02:04
## # … with 490 more rows, and abbreviated variable name ¹​comms_num
vader_redditvm <- vader_df(redditvm_sample$body)
vader_redditvm <- vader_redditvm %>%
  drop_na()
mean(vader_redditvm$compound)
## [1] 0.01676064

Overall, with an average of 0.00695 it appears that in the sample size of 500, the sentiments appear to skew toward the negative. To get a more clear number for each sentiment, I will categorize positive, negative or neutral sentiments using the sentiment threshold classification.

vader_redditvm_summary <- vader_redditvm %>% 
  mutate(sentiment = ifelse(compound >= 0.05, "positive",
                            ifelse(compound <= -0.05, "negative", "neutral"))) %>%
  count(sentiment, sort = TRUE) %>% 
  spread(sentiment, n) %>% 
  relocate(positive) %>%
  mutate(ratio = negative/positive)

vader_redditvm_summary
##   positive negative neutral    ratio
## 1      143      139      94 0.972028

Here based on the classification, it is evident that the overall sentiment is negative, with positive slightly behind. This is in line with the initial calculation was which skews more toward neutrality.

Now, I want to visualize my results to get another picture of the overall sentiment values from the data sample.

install.packages("plotrix")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(plotrix)
slices <- c(145, 143, 98)
lbls <- c( "Negative (145)", "Positive (143)", "Neutral (98)")
pie(slices,labels=lbls,
   main="Reddit Vaccine Myths Sentiments", radius= 1, col= c("Pink", "Purple", "Yellow"))

IV. Communicate

The key questions of this study were:

  1. What are the sentiments toward vaccines in the r/RedditMyths subreddit?
  2. Are the sentiments of the r/RedditMyths subreddit in line with its intended purpose?
  3. Which years saw the most discussion on vaccines and what inferences can be made on why?

The findings show that the overall sentiment toward vaccines were positive, which are in fact in line with the intended purpose of the subreddit, which is to discussion and dispel myths surrounding vaccines. Despite this, however, negative sentiments came pretty close, signifying that there does seem to be some back and forth discourse between the two, despite the goal of the subreddit. One could argue that these debates, while on opposing sides, especially for the posts with higher engagement, can allow for the exchange of ideas that may lead to more education on the subject.

Since the beginning of the subreddit’s existence in 2014 until the present, this study showed that there was a surge in posts around 2019, which is representative of the Sars-cOV2 (Covid-19) outbreak. However, since discussion on vaccines have been happening prior, the subreddit can be labeled as a reflection of opinions on other vaccines in existence as well.

In the future, I am interesting in doing further manipulation and wrangling of the text to gain more accuracy in the sentiments. Though VADER was able to get a good idea of how to calculate sentiments, hand inspection of the data revealed areas where instances were incorrectly labeled as positive, negative or neutral. For example, one particular post that was mostly expletive and a negative dig at another individual was labeled as positive, most likely for the fact that the statement began with “laughing”. It is more difficult for machines to measure intent in a way that humans already understand, but I would like to get closer. I think in doing so, it will allow for a clearer picture of what is being said and done to have discourse and dispel myths on social media channels and how this can be leveraged on a larger scale. Some of the most used words surrounded government and media, which is telling.

This particular research is limited in that it may not adequately identify societal sentiments on vaccines and the myths surrounding them due to a limited sample size of Reddit data. In the future, more research may be done using a more neutral subreddit, or one that explores sentiments about vaccines given specific contexts such as r/Politics or r/WorldWatch or more generalized discussion forums such as r/science to see what the sentiments look like.

References

Hussain A, Ali S, Ahmed M, Hussain S. The Anti-vaccination Movement: A Regression in Modern Medicine. Cureus. 2018 Jul 3;10(7):e2919. doi: 10.7759/cureus.2919. PMID: 30186724; PMCID: PMC6122668.

Krumm, A., Means, B., & Bienkowski, M. (2018). Learning analytics goes to school: A collaborative approach to improving education.

Strassburg MA. The global eradication of smallpox. Am J Infect Control. 1982 May;10(2):53-9. doi: 10.1016/0196-6553(82)90003-7. PMID: 7044193.