1. PREPARE

1a. Reviewed Literature

There has been some initial research how sentiment might affect learning in online forums (Kagklis et al, 2015), as well as how text mining could reveal some of the discursive processes within Reddit communities (Mueller, 2016; White, 2019). I want to build on these lines of inquiry by seeing how one particularly structured Reddit community known as r/ChangeMyView might be successful or unsuccessful in its objectives, based off of the word choice and sentiment of the discussion that happens there.

1b. Defined Questions

I defined two main questions for my research:

  1. What are the most frequent words or phrases used in r/ChangeMyView’s most visibly productive and unproductive threads?

  2. How does sentiment within the productive threads compare to the sentiment within the unproductive threads?

For RQ1, I specifically defined my samples from r/ChangeMyView as “most visibly productive,” as this is exploratory research and there might be different, or better, ways of selecting and defining these terms. As it stands, “most visibly productive” refers to how I was able to use Reddit’s integrated “top” and “controversial” sorting functions to identify high discourse generation across all r/CMV posts.

1c. Installed Packages

My project used a a few packages, most of which were featured in our class:

  • The tidyverse suite

  • tidytext for tidying data

  • vader, for sentiment analysis

  • here

  • textdata

  • wordcloud2 for the word cloud visualization

  • stringr, which provided some more nuance to filter()

Additionally, I needed a way to scrape data from Reddit. Much to my pleasant surprise, I found a package called RedditExtractoR that interfaces with Reddit’s API to scrape data and metadata from subreddits, individual threads, or even users!

install.packages(c("tidyverse", "tidytext", "vader","here", "textdata", "wordcloud2", "wordcloud2", "stringr" "RedditExtractoR"))
## Error: <text>:1:111: unexpected string constant
## 1: install.packages(c("tidyverse", "tidytext", "vader","here", "textdata", "wordcloud2", "wordcloud2", "stringr" "RedditExtractoR"
##                                                                                                                   ^
library(tidyverse)
## Error in library(tidyverse): there is no package called 'tidyverse'
library(tidytext)
## Error in library(tidytext): there is no package called 'tidytext'
library(vader)
## Error in library(vader): there is no package called 'vader'
library(here)
## Error in library(here): there is no package called 'here'
library(wordcloud2)
## Error in library(wordcloud2): there is no package called 'wordcloud2'
library(stringr)
library(RedditExtractoR)
## Error in library(RedditExtractoR): there is no package called 'RedditExtractoR'

1d. Selected r/ChangeMyView Samples

I chose to identify my threads using the integrated sorting function built in to Reddit’s interface. I chose my “most productive” threads by selecting “top posts of all time,” which generally means that these posts received a very high total upvote to downvote ratio. In short, people wanted these 10 posts to make it to the most visible spot in r/ChangeMyView. In a perfect world, this community upvoted these because they had the most productive and interesting discourse, although it could be that people simply agreed with the OP’s viewpoint. Still, sorting by top was a good way to find lots of commentary and activity.

The top 10 threads all included Deltas awarded by OP as well, which I found promising. As a reminder from my proposal from two weeks ago, a “Delta” can be given from any member of r/CMV to another for being successfully and concretely changing their mind in some way regarding the original topic.

Identifying the “least productive” threads of all time proved a little more difficult. It doesn’t seem that the community has curated a list of unproductive threads, nor is there a default “bottom posts” sorting method on Reddit. There is, however, a way of finding the most volatile threads. Through sorting by “controversial” one can find posts that have a much more even ratio of upvotes to downvotes, indicating that the community had trouble agreeing if this was an appropriate or representative discussion for r/ChangeMyView.

I still wanted to account for the presence of Deltas here, so I hand selected the top 10 posts that did not include a Delta from OP. The thread still could have Deltas within the comments, however. Approximately every other post in the 20 most controversial set fit this criterion.

Thankfully, in order to prepare these threads for scraping, I needed only to collect their URLs and have them ready in two lists: One for productive threads and one for unproductive threads.

2. Wrangle

Imported Data from Reddit API

For this project I wanted to scrape all body text posts from these 20 threads. RedditExtractoR offers a few functions, but the only one I needed this time was get_thread_content. I scraped data from the 20 threads and saved them as top_productive and top_unproductive, respectively:

top_productive <- get_thread_content(c(
 "https://www.reddit.com/r/changemyview/comments/fdziov/cmv_mike_bloombergs_campaign_is_proof_that_the/",
  "https://www.reddit.com/r/changemyview/comments/hlpd7d/cmv_kanye_west_is_a_shill_for_president_trump_and/", 
  "https://www.reddit.com/r/changemyview/comments/mzr23d/cmv_most_americans_who_oppose_a_national/", 
  "https://www.reddit.com/r/changemyview/comments/iq41dt/cmv_donald_trump_has_not_made_a_single_lasting/", 
  "https://www.reddit.com/r/changemyview/comments/kvwbxj/cmv_being_a_conservative_is_the_least_christlike/", 
  "https://www.reddit.com/r/changemyview/comments/p9c6x2/cmv_voluntarily_unvaccinated_people_should_be/", 
  "https://www.reddit.com/r/changemyview/comments/jfz65t/cmv_the_work_hard_and_dont_give_up_message_common/", 
  "https://www.reddit.com/r/changemyview/comments/hs9xnd/cmv_politicians_should_be_required_to_wear/", 
  "https://www.reddit.com/r/changemyview/comments/kyjzxi/cmv_democrats_and_republicans_live_in_completely/",
  "https://www.reddit.com/r/changemyview/comments/mglg30/cmv_folks_is_a_reasonably_inclusive_gender/"
))
## Error in get_thread_content(c("https://www.reddit.com/r/changemyview/comments/fdziov/cmv_mike_bloombergs_campaign_is_proof_that_the/", : could not find function "get_thread_content"
top_unproductive <- get_thread_content(c(
  "https://www.reddit.com/r/changemyview/comments/rafqmw/cmv_cancel_culture_doesnt_actually_exist_the/
","https://www.reddit.com/r/changemyview/comments/2slxhh/cmv_racism_against_white_people_in_america_doesnt/
","https://www.reddit.com/r/changemyview/comments/2scqee/cmv_pepsi_is_the_inferior_soda/
","https://www.reddit.com/r/changemyview/comments/n69wmu/cmv_the_republicans_are_a_threat_to_democracy/
","https://www.reddit.com/r/changemyview/comments/2ks3pk/cmv_gaming_community_of_reddit_is_full_of/
","https://www.reddit.com/r/changemyview/comments/2vv56q/cmv_britney_spears_is_just_as_good_as_johnny_cash/
","https://www.reddit.com/r/changemyview/comments/1v37km/i_think_that_banning_downvote_brigades_from/
","https://www.reddit.com/r/changemyview/comments/ppp1gq/cmv_rich_people_are_inherently_kind_of_villainous/
","https://www.reddit.com/r/changemyview/comments/26lvwv/cmv_the_pick_up_artist_community_and_specifically/
","https://www.reddit.com/r/changemyview/comments/q5yxcw/cmv_when_people_say_theyre_only_against_illegal/"
))
## Error in get_thread_content(c("https://www.reddit.com/r/changemyview/comments/rafqmw/cmv_cancel_culture_doesnt_actually_exist_the/\n", : could not find function "get_thread_content"

This function is indiscriminate about what data to pull, saving it as a list that contains two dataframes–one for metadata and one for comments. Because I was only interested in the body text, I wanted to extract this comment dataframe from the list first and save them as comments_productive and comments_unproductive:

comments_productive <- top_productive[["comments"]]
## Error in eval(expr, envir, enclos): object 'top_productive' not found
comments_unproductive <- top_unproductive[["comments"]]
## Error in eval(expr, envir, enclos): object 'top_unproductive' not found

Restructured the Data

I wanted to restructure the data to remove any irrelevant text from the moderation process. For my productive set, I used filter() to remove any comments from DeltaBot, which automatically notes when Deltas from OP were given. I didn’t have to do this for my unproductive threads, as I already pre-selected for this quality. I then used select() to strip both sets down to the comment text.

comments_productive_2 <- comments_productive %>%
  filter(author != "DeltaBot") %>%
  select(comment)
## Error in select(., comment): could not find function "select"
comments_unproductive_2 <- comments_unproductive %>%
  select(comment)
## Error in select(., comment): could not find function "select"

I also wanted to remove any entries that had been deleted by the user or pruned by moderators at the time of data retrieval. I did this with an additional filter(), first for any entries simply notated in brackets as deleted or removed, then using str_detect() to remove strings that contained the phrase “your comment has been removed”, which appeared to be manually entered by mods for specific pruning.

tidy_productive <- comments_productive_2 %>%
  filter(comment != "[deleted]") %>%
  filter(comment != "[removed]") %>%
  filter(!str_detect(comment, "your comment has been removed"))
## Error in as.ts(x): object 'comments_productive_2' not found
tidy_unproductive <- comments_unproductive_2 %>%
   filter(comment != "[deleted]") %>%
   filter(comment != "[removed]") %>%
   filter(!str_detect(comment, "your comment has been removed"))
## Error in as.ts(x): object 'comments_unproductive_2' not found

Tidied Text

Tokenized the Data

I tokenized both dataframes into unigram tokens with tidytext::unnest_tokens():

unigram_productive <- tidy_productive %>%
  unnest_tokens(output = word,
                input = comment)
## Error in unnest_tokens(., output = word, input = comment): could not find function "unnest_tokens"
unigram_unproductive <- tidy_unproductive %>%
  unnest_tokens(output = word,
                input = comment)
## Error in unnest_tokens(., output = word, input = comment): could not find function "unnest_tokens"

Removed Stop Words

It was time to remove less useful words, starting with the “default” stop_words lexicon. I pulled them from both sets using anti_join:

unigram_productive_2 <- anti_join(unigram_productive,
                         stop_words,
                         by = "word")
## Error in anti_join(unigram_productive, stop_words, by = "word"): could not find function "anti_join"
unigram_unproductive_2 <- anti_join(unigram_unproductive,
                            stop_words,
                            by = "word")
## Error in anti_join(unigram_unproductive, stop_words, by = "word"): could not find function "anti_join"

Next, I used count() to look at the remaining top tokens to see if there were any stragglers.

unigram_productive_3 %>%
  count(word, sort = TRUE)
## Error in count(., word, sort = TRUE): could not find function "count"
unigram_unproductive_3 %>%
  count(word, sort = TRUE)
## Error in count(., word, sort = TRUE): could not find function "count"

The words that stood out were forum and html artifacts, such as “https” and “gt”. I made a frame for my_stop_words and pulled them from both sets using filter(). I also decided not to filter by possibly_sensitive for this, mostly just to see what happens!

my_stopwords <- c("https", "http", "gt", "amp", "1", "2", "3", "10", "don", "www.reddit.com", "x200b", "i.imgur.com")

unigram_productive_3 <-
  unigram_productive_2 %>%
  filter(!word %in% my_stopwords)
## Error in as.ts(x): object 'unigram_productive_2' not found
unigram_unproductive_3 <-
  unigram_unproductive_2 %>%
  filter(!word %in% my_stopwords)
## Error in as.ts(x): object 'unigram_unproductive_2' not found

Repeated for Bigrams

I wanted to see what bigram tokenization would look like as well, so I ran the ngram argument for unnest_tokens, then removed stop words with separate() and filter(), and unite().

bigram_productive <- tidy_productive %>% 
  unnest_tokens(output = bigram, 
                input = comment, 
                token = "ngrams", 
                n = 2)
## Error in unnest_tokens(., output = bigram, input = comment, token = "ngrams", : could not find function "unnest_tokens"
bigram_separated_productive <- bigram_productive %>%
  separate(bigram, c("word1", "word2"), sep = " ")
## Error in separate(., bigram, c("word1", "word2"), sep = " "): could not find function "separate"
bigram_filtered_productive <- bigram_separated_productive %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
## Error in as.ts(x): object 'bigram_separated_productive' not found
bigram_tidied_productive <- bigram_filtered_productive %>%
  unite(bigram, word1, word2, sep = " ")
## Error in unite(., bigram, word1, word2, sep = " "): could not find function "unite"
bigram_unproductive <- tidy_unproductive %>% 
  unnest_tokens(output = bigram, 
                input = comment, 
                token = "ngrams", 
                n = 2)
## Error in unnest_tokens(., output = bigram, input = comment, token = "ngrams", : could not find function "unnest_tokens"
bigram_separated_unproductive <- bigram_unproductive %>%
  separate(bigram, c("word1", "word2"), sep = " ")
## Error in separate(., bigram, c("word1", "word2"), sep = " "): could not find function "separate"
bigram_filtered_unproductive <- bigram_separated_unproductive %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
## Error in as.ts(x): object 'bigram_separated_unproductive' not found
bigram_tidied_unproductive <- bigram_filtered_unproductive %>%
  unite(bigram, word1, word2, sep = " ")
## Error in unite(., bigram, word1, word2, sep = " "): could not find function "unite"

Here’s what they looked like:

bigram_tidied_productive %>% 
  count(bigram, sort = TRUE)
## Error in count(., bigram, sort = TRUE): could not find function "count"
bigram_tidied_unproductive %>% 
  count(bigram, sort = TRUE)
## Error in count(., bigram, sort = TRUE): could not find function "count"

I then gave them one more round with my custom stop words:

bigram_tidied_productive <- bigram_separated_productive %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  filter(!word1 %in% my_stopwords) %>%
  filter(!word2 %in% my_stopwords) %>%
  unite(bigram, word1, word2, sep = " ")
## Error in unite(., bigram, word1, word2, sep = " "): could not find function "unite"
bigram_tidied_productive %>% 
  count(bigram, sort = TRUE)
## Error in count(., bigram, sort = TRUE): could not find function "count"
bigram_tidied_unproductive <- bigram_separated_unproductive %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  filter(!word1 %in% my_stopwords) %>%
  filter(!word2 %in% my_stopwords) %>%
  unite(bigram, word1, word2, sep = " ")
## Error in unite(., bigram, word1, word2, sep = " "): could not find function "unite"
bigram_tidied_unproductive %>% 
  count(bigram, sort = TRUE)
## Error in count(., bigram, sort = TRUE): could not find function "count"

3. Explore

Top Tokens

Now it was time to see the top tokens for each set:

productive_top_tokens <- unigram_productive_3 %>%
  count(word, sort = TRUE) %>%
  top_n(50)
## Error in top_n(., 50): could not find function "top_n"
unproductive_top_tokens <- unigram_unproductive_3 %>%
  count(word, sort = TRUE) %>%
  top_n(50)
## Error in top_n(., 50): could not find function "top_n"

Word Cloud

It was time to stare at some clouds. Using wordcloud2(), I made clouds for each set:

wordcloud2(productive_top_tokens,fontWeight = 'normal',
    color = 'random-light', backgroundColor = "black")
## Error in wordcloud2(productive_top_tokens, fontWeight = "normal", color = "random-light", : could not find function "wordcloud2"
wordcloud2(unproductive_top_tokens,fontWeight = 'normal',
    color = 'random-light', backgroundColor = "black")
## Error in wordcloud2(unproductive_top_tokens, fontWeight = "normal", color = "random-light", : could not find function "wordcloud2"

It looked like “people” was a pretty common denominator (I might have anticipated that if I looked more carefully at the top token frequencies before modeling). I filtered it out and ran it again:

productive_top_tokens_2 <- productive_top_tokens %>%
  filter(word != "people")
## Error in as.ts(x): object 'productive_top_tokens' not found
wordcloud2(productive_top_tokens_2,fontWeight = 'normal',
    color = 'random-light', backgroundColor = "black")
## Error in wordcloud2(productive_top_tokens_2, fontWeight = "normal", color = "random-light", : could not find function "wordcloud2"
unproductive_top_tokens_2 <- unproductive_top_tokens %>%
  filter(word != "people")
## Error in as.ts(x): object 'unproductive_top_tokens' not found
wordcloud2(unproductive_top_tokens_2,fontWeight = 'normal',
    color = 'random-light', backgroundColor = "black")
## Error in wordcloud2(unproductive_top_tokens_2, fontWeight = "normal", : could not find function "wordcloud2"

Some differences started to show through! For those curious, the token “srs” in the unproductive cloud refers to the subreddit r/ShitRedditSays. This has been a rather polarizing “meta-subreddit” that gathers and calls out posts and activity on Reddit that other users consider problematic. As you can imagine, it comes up a lot in Reddit discourse on free speech, “cancel culture,” etc.

4. Model

Sentiment Analysis

How will these sets hold up to some sentiment analysis? vader was able to offer some insight. I took my tidy comment sets and ran them through to see what vader was able to find.

Sampled Productive Unigram Set

Unfortunately, my fears were realized that my tidy_productive set was too big. It looks like vader crashed my entire project while trying to run it! For the purposes of this assignment, I took a sample to make it a bit more manageable, matching the number of observations to tidy_unproductive since it was several times smaller. Since there was a large variance in comment length, I hoped that, with a large n, this sample would also average the same wordcount per observation as tidy_unproductive.

tidy_productive_sample <- sample_n(tidy_productive, 728)
## Error in sample_n(tidy_productive, 728): could not find function "sample_n"
vader_productive <- vader_df(tidy_productive_sample)
## Error in vader_df(tidy_productive_sample): could not find function "vader_df"
vader_unproductive <- vader_df(tidy_unproductive)
## Error in vader_df(tidy_unproductive): could not find function "vader_df"

Summarized Analysis

I then summarized vader’s findings for comparison:

vader_productive_summary <- vader_productive %>% 
  mutate(sentiment = ifelse(compound >= 0.05, "positive",
                            ifelse(compound <= -0.05, "negative", "neutral"))) %>%
  count(sentiment, sort = TRUE) %>% 
  spread(sentiment, n) %>% 
  relocate(positive) %>%
  mutate(ratio = negative/positive)
## Error in mutate(., ratio = negative/positive): could not find function "mutate"
vader_productive_summary
## Error in eval(expr, envir, enclos): object 'vader_productive_summary' not found

I noticed that vader_productive_summary included two <NA> results. I don’t know what it means, or how to find those two comments (or others like it within the larger sample) but it has implications for further data tidying.

vader_unproductive_summary <- vader_unproductive %>% 
  mutate(sentiment = ifelse(compound >= 0.05, "positive",
                            ifelse(compound <= -0.05, "negative", "neutral"))) %>%
  count(sentiment, sort = TRUE) %>% 
  spread(sentiment, n) %>% 
  relocate(positive) %>%
  mutate(ratio = negative/positive)
## Error in mutate(., ratio = negative/positive): could not find function "mutate"
vader_unproductive_summary
## Error in eval(expr, envir, enclos): object 'vader_unproductive_summary' not found

5. Communicate

Conclusions

To summarize my research, I revisited my two research questions:

  1. What are the most frequent words or phrases used in r/ChangeMyView’s most visibly productive and unproductive threads?

    Note: Noticing how “people” dominated both word clouds, I removed it from the unigram lists:

    Productive Unigrams Unproductive Unigrams
    trump white
    money racism
    healthcare person
    government black
    tax republicans
    hard bad
    time women
    system immigration
    care view
    pay racist
    Productive Bigrams Unproductive Bigrams
    virtue signaling white people
    health care cancel culture
    healthcare system black people
    national healthcare illegal immigration
    middle class white person
    minimum wage legal immigration
    health insurance mentally ill
    fox news black person
    donald trump death threats
    social media red pill

    There are several “hot button” words and phrases across unigrams, bigrams, productive and unproductive threads here. On their own, I’m not sure how much can be gleaned, but there might be indications that less productive conversation is more concerned with race, nationality and ethnicity, while more productive conversation is about economic issues, and other public concerns such as healthcare.

  2. How does sentiment within the productive threads compare to the sentiment within the unproductive threads?

    It appears that the unproductive set has much more negative sentiment overall, going from a positive/negative ratio of .66 to 1.24. While this is consistent with many of the recommendations within r/ChangeMyView, more research is needed about exactly what discursive processes might be occurring here which leads to conversational collapse.

Additional Areas of Research

Looking ahead, the metadata available through RedditExtractoR has a lot of potential. A great study that would build on this exploratory analysis could involve social network analysis, for example; by modeling how the OPs branch off to specific users and comment trees, we could better see if specific users are dominating productive or productive behavior (similar to transmitters, transponders & transcenders).

Another “metadatum” that wasn’t available for use this time were the Deltas themselves. I only selected posts for whether or not OP awarded at least one out, but they collectively could offer a very interesting breadcrumb trail of productive discourse. I bet I could find patterns in people who changed their minds deep within the comments, even within threads that I selected as the most visibly unproductive.

I also didn’t get a chance to run analysis for other kinds of sentiment this time, but I am still very interested in using additional lexicons to identify other qualities such as “trust/distrust” in these conversations.

Limitations

A major limitation, largely due to how I understood RedditExtractoR’s capabilities, is that the original post itself was absent from the text mining. A future iteration of this study would also import and tokenize this data, possibly separately as well as with the comment data, to see if how a particular viewpoint is phrased might lend to certain kinds of productive or unproductive discourse.

Legal/Ethical Considerations

All information was freely and publicly available using Reddit’s API, per Reddit’s EULA. Since Reddit uses usernames, it is reasonable to assume that this data is appropriately anonymous.

References

Kagklis, V., Karatrantou, A., Tantoula, M., Panagiotakopoulos, C. T., & Verykios, V. S. (2015). A Learning Analytics Methodology for Detecting Sentiment in Student Fora: A Case Study in Distance Education. European Journal of Open, Distance and E-Learning, 18(2), 74–94.

Mueller, C. (2016). Positive Feedback Loops: Sarcasm and the Pseudo-Argument in Reddit Communities. Working Papers in TESOL & Applied Linguistics, 16(2), 84–97.

White, A. M. (2019). Reddit as an Analogy for Scholarly Publishing and the Constructed, Contextual Nature of Authority. Communications in Information Literacy, 13(2), 147–163.