Exploring the Reddit API with RedditExtractoR

Functions

Using RedditExtractoR is not very difficult, and using a handful of commands, plenty of data can be gathered on specified topics, search terms, or from specific users.

reddit_urls

This first function simply pulls URLs of Reddit threads that include the specific search term. A data set can be created with all of the attributes of the thread. For an example, let’s find all of the links for threads and comments where the author “Tolkien” is mentioned. We’ll create a data frame called “links” that will list all of the relevant URLs.

links <- reddit_urls(
  search_terms   = "Tolkien",
  page_threshold = 2
)

This returns a data frame with 50 results including the URL, the title of the thread, posting date, the number of comments, and the name of the subreddit where it was posted.

Parameters than can be used for the reddit_urls function include:

Parameter	Usage
search_terms	A word or words to specify topics for search
page_threshold	The number of pages to search, limiting results (default is 1)
cn_threshold	A threshold to include only results with a certain number of comments
subreddit	Specify the subreddit (Reddit thread/category) to pull results from
regex_filter	Specify terms to exclude results
sort_by	Results can be sorted by comments or new

reddit_content

This function finds all of the comments of a particular thread. For this example, we’ll use the “links” data frame created above and find all of the content from the first URL, or thread, in that set. Note that this search can take some time, so it may be helpful to limit the search as applicable.

content <- reddit_content(links$URL[1])

This returns a data set with all of the attributes for each comment in the thread, including the username and date. One interesting attribute is the structure column, which provides a numerical level of each comment: the first comment in a thread will be 1, the second 2, the first reply to the second comment will be 2-1, and so on.

get_reddit

A sort of combination of the first two functions, this one will find all mentions of the specified search term, including the thread and all comments with their attributes. In the case of our example, we’ll create a data set called “getcontent” and search again for “Tolkien.” But this time we will limit the results to only those threads that had more than 2500 comments; this function will find a lot of results, so narrowing it down to just threads that had large discussions is useful (and quicker to run!):

getcontent <- get_reddit(
  search_terms = "Tolkien",
  page_threshold = 1,
  cn_threshold = 2500
)

This data set is large, with a seperate record for each comment of each thread with the search result, and includes all of the attibutes of the reddit_urls and reddit_content functions. Likewise, the same parameters as the reddit_content function can be used to narrow results.

construct_graph

To make use of some of this data, the construct_graph function will give a visual structure of a thread and its comment chains. For the Tolkien example, we will plug the content data set created earlier into the function to get a graph:

graph <- construct_graph(content, plot = TRUE)

user_network

Finally, this function will give a visualization of a single thread, showing the network web of users and comments branching out from the original thread post.

user <- user_network(content, include_author = TRUE, agg = TRUE)
user$plot

The result here is a very large web, as this example is a very complex discussion with many users and comments. However, the visualization helps to see the branches of discussion and how users have interacted with each other in a way that is much quicker to grasp than from the raw data.

Parameter	Usage
include_author	A TRUE or FALSE option that will include the comment author’s name on the web
agg	A TRUE or FALSE option that, if TRUE, will aggregate comments by a single user

Hopefully this quick tutorial on the use of the RedditExtractoR package with the Reddit API is a good starting place for finding all sorts of interesting data from Reddit. As a site with a lot of content and millions of users, there are infinite possibilities of analysis that can be done with this data!

Exploring the Reddit API with RedditExtractoR

Mike Swofford

10/7/2019