Introduction

Reddit.com is a social media news and discussion website, where votes promote user-provided stories, links, and comments to the front page of the site. As of October 2019, Reddit.com is the 18th most visited site in the world according to Alexa Internet.

As a site that relies upon users for content, interaction, and moderation, some interesting data can be pulled from Reddit that can show real-time trends of popular topics. Users “upvote” and “downvote” each others comments, with upvoted comments rising higher in threads. So a popular comment is much more likely to be seen by another user, and perhaps then further upvoted or commented upon. This makes for a complex network of users and discussions that also provides insight into trending topics.

Reddit is made up of “subreddits” which are categories of threads of a similar topic or type. There are subreddits for politics, celebrities, memes, video games, and just about every other topic one can think of (and some one might never expect!).

RedditExtractoR

An easy way to collect data from Reddit is using the R package RedditExtractoR. The readme for this package describes it as "An R wrapper for Reddit API. This package can be used to extract data from Reddit and construct structured datasets.

This basic API interaction does not require any registration with Reddit or use of a token for usage in R.

To begin, first install and load the RedditExtractoR package:

install.packages("RedditExtractoR")
libary(RedditExctractoR)

Functions

Using RedditExtractoR is not very difficult, and using a handful of commands, plenty of data can be gathered on specified topics, search terms, or from specific users.

reddit_urls

This first function simply pulls URLs of Reddit threads that include the specific search term. A data set can be created with all of the attributes of the thread. For an example, let’s find all of the links for threads and comments where the author “Tolkien” is mentioned. We’ll create a data frame called “links” that will list all of the relevant URLs.

links <- reddit_urls(
  search_terms   = "Tolkien",
  page_threshold = 2
)

This returns a data frame with 50 results including the URL, the title of the thread, posting date, the number of comments, and the name of the subreddit where it was posted.

Parameters than can be used for the reddit_urls function include:

Parameter Usage
search_terms A word or words to specify topics for search
page_threshold The number of pages to search, limiting results (default is 1)
cn_threshold A threshold to include only results with a certain number of comments
subreddit Specify the subreddit (Reddit thread/category) to pull results from
regex_filter Specify terms to exclude results
sort_by Results can be sorted by comments or new

reddit_content

This function finds all of the comments of a particular thread. For this example, we’ll use the “links” data frame created above and find all of the content from the first URL, or thread, in that set. Note that this search can take some time, so it may be helpful to limit the search as applicable.

content <- reddit_content(links$URL[1])

This returns a data set with all of the attributes for each comment in the thread, including the username and date. One interesting attribute is the structure column, which provides a numerical level of each comment: the first comment in a thread will be 1, the second 2, the first reply to the second comment will be 2-1, and so on.


get_reddit

A sort of combination of the first two functions, this one will find all mentions of the specified search term, including the thread and all comments with their attributes. In the case of our example, we’ll create a data set called “getcontent” and search again for “Tolkien.” But this time we will limit the results to only those threads that had more than 2500 comments; this function will find a lot of results, so narrowing it down to just threads that had large discussions is useful (and quicker to run!):

getcontent <- get_reddit(
  search_terms = "Tolkien",
  page_threshold = 1,
  cn_threshold = 2500
)

This data set is large, with a seperate record for each comment of each thread with the search result, and includes all of the attibutes of the reddit_urls and reddit_content functions. Likewise, the same parameters as the reddit_content function can be used to narrow results.


construct_graph

To make use of some of this data, the construct_graph function will give a visual structure of a thread and its comment chains. For the Tolkien example, we will plug the content data set created earlier into the function to get a graph:

graph <- construct_graph(content, plot = TRUE)

user_network

Finally, this function will give a visualization of a single thread, showing the network web of users and comments branching out from the original thread post.

user <- user_network(content, include_author = TRUE, agg = TRUE)
user$plot

The result here is a very large web, as this example is a very complex discussion with many users and comments. However, the visualization helps to see the branches of discussion and how users have interacted with each other in a way that is much quicker to grasp than from the raw data.

Parameter Usage
include_author A TRUE or FALSE option that will include the comment author’s name on the web
agg A TRUE or FALSE option that, if TRUE, will aggregate comments by a single user

Hopefully this quick tutorial on the use of the RedditExtractoR package with the Reddit API is a good starting place for finding all sorts of interesting data from Reddit. As a site with a lot of content and millions of users, there are infinite possibilities of analysis that can be done with this data!