To collect Reddit data in R, you’ll need the assistance of a package. We’ll work with rreddit, an in-development package created by Mike Kearney. To install, run this code:

if (!requireNamespace("remotes")) {
  install.packages("remotes")
}

## install from github
remotes::install_github("jacob-long/rreddit", upgrade = "always")

Okay, now you have rreddit installed. Let’s load it.

library(rreddit)

There are two main functions: get_reddit_posts() and get_reddit_comments(). They do what they sound like they would do.

Getting all posts in a subreddit

Let’s fetch some posts from the “whitesox” subreddit.

posts <- get_reddit_posts(subreddit = "whitesox", n = 100)
## ✔ #1: collected 100 posts

Simple enough: use the subreddit argument to specify the subreddit, and n to tell the function how many posts to fetch. By default, it grabs the most recent n posts.

Let’s look at what the data looks like.

posts

Okay, lots of columns and 100 rows. I’ll trim to the first few of each. If you’re interested in the text of what has been posted, you’ll look at either title (the title of the post) or selftext, which is the text of text posts.

To collect more data, just increase the number that you give to n. Note that it can take quite a while to collect a large number of posts.

If you want data from a specific timeframe, use before and after. For example:

march_posts <- get_reddit_posts(subreddit = "whitesox", n = 200,
                                before = "2020-04-01", after = "2020-03-01")
## ✔ #1: collected 100 posts
## ✔ #2: collected 100 posts

This restricts the results to the month of March. Of course, there were many more than even 200 posts in March so the data are incomplete. For a relatively small subreddit such as this, you can safely set n to a very large number to make sure you get everything as long as you’re patient.

march_posts <- get_reddit_posts(subreddit = "whitesox", n = 100000,
                                before = "2020-04-01", after = "2020-03-01")
## ✔ #1: collected 100 posts
## ✔ #2: collected 100 posts
## ✔ #3: collected 100 posts
## ✔ #4: collected 16 posts

You can look at the output, do some math, and see how many posts you’ve fetched. Of course with larger datasets, that will be harder to do with your eyeballs. Alternatively, you can use the nrow() function, which tells you how many rows there are in your data frame.

nrow(march_posts)
## [1] 316

Okay, 316.

Searching for posts

This works just like before, except you have a new argument: query

If I’m concerned that talk about the White Sox happens outside the “whitesox” subreddit, I can search for “white sox” across all subreddits.

all_posts <- get_reddit_posts(subreddit = "all", query = "white sox", n = 200)
## ✔ #1: collected 100 posts
## ✔ #2: collected 100 posts

Let’s take a look at the results…

all_posts

Should be familiar, except now we pay more attention to the subreddit column. Note that like before, there are many, many, many results for many queries and the results are sorted from newest to oldest. You may need to wait a long time to get what you need.

Getting comments

Comments are very similar to posts. Let’s take a look.

comments <- get_reddit_comments(subreddit = "whitesox", n = 100)
## ✔ #1: collected 100 posts

Now let’s look at the data.

comments

The main difference is that there’s less metadata and the text is contained in body instead of selftext.

Making a time series plot

This package includes a function, ts_plot(), that will help you get started using ggplot2 to plot the number of posts/comments over time.

Let’s start by collecting more data: a larger number of posts from the “whitesox” subreddit.

posts <- get_reddit_posts(subreddit = "whitesox", n = 10000)
## ✔ #1: collected 100 posts
## ✔ #2: collected 100 posts
## ✔ #3: collected 100 posts
## ✔ #4: collected 100 posts
## ✔ #5: collected 100 posts
## ✔ #6: collected 100 posts
## ✔ #7: collected 100 posts
## ✔ #8: collected 100 posts
## ✔ #9: collected 100 posts
## ✔ #10: collected 100 posts
## ✔ #11: collected 100 posts
## ✔ #12: collected 100 posts
## ✔ #13: collected 100 posts
## ✔ #14: collected 100 posts
## ✔ #15: collected 100 posts
## ✔ #16: collected 100 posts
## ✔ #17: collected 100 posts
## ✔ #18: collected 100 posts
## ✔ #19: collected 100 posts
## ✔ #20: collected 100 posts
## ✔ #21: collected 100 posts
## ✔ #22: collected 100 posts
## ✔ #23: collected 100 posts
## ✔ #24: collected 100 posts
## ✔ #25: collected 100 posts
## ✔ #26: collected 100 posts
## ✔ #27: collected 100 posts
## ✔ #28: collected 100 posts
## ✔ #29: collected 100 posts
## ✔ #30: collected 100 posts
## ✔ #31: collected 100 posts
## ✔ #32: collected 100 posts
## ✔ #33: collected 100 posts
## ✔ #34: collected 100 posts
## ✔ #35: collected 100 posts
## ✔ #36: collected 100 posts
## ✔ #37: collected 100 posts
## ✔ #38: collected 100 posts
## ✔ #39: collected 100 posts
## ✔ #40: collected 100 posts
## ✔ #41: collected 100 posts
## ✔ #42: collected 100 posts
## ✔ #43: collected 100 posts
## ✔ #44: collected 100 posts
## ✔ #45: collected 100 posts
## ✔ #46: collected 100 posts
## ✔ #47: collected 100 posts
## ✔ #48: collected 100 posts
## ✔ #49: collected 100 posts
## ✔ #50: collected 100 posts
## ✔ #51: collected 100 posts
## ✔ #52: collected 100 posts
## ✔ #53: collected 100 posts
## ✔ #54: collected 100 posts
## ✔ #55: collected 100 posts
## ✔ #56: collected 100 posts
## ✔ #57: collected 100 posts
## ✔ #58: collected 100 posts
## ✔ #59: collected 100 posts
## ✔ #60: collected 100 posts
## ✔ #61: collected 100 posts
## ✔ #62: collected 100 posts
## ✔ #63: collected 100 posts
## ✔ #64: collected 100 posts
## ✔ #65: collected 100 posts
## ✔ #66: collected 100 posts
## ✔ #67: collected 100 posts
## ✔ #68: collected 100 posts
## ✔ #69: collected 100 posts
## ✔ #70: collected 100 posts
## ✔ #71: collected 100 posts
## ✔ #72: collected 100 posts
## ✔ #73: collected 100 posts
## ✔ #74: collected 100 posts
## ✔ #75: collected 100 posts
## ✔ #76: collected 100 posts
## ✔ #77: collected 100 posts
## ✔ #78: collected 100 posts
## ✔ #79: collected 100 posts
## ✔ #80: collected 100 posts
## ✔ #81: collected 100 posts
## ✔ #82: collected 100 posts
## ✔ #83: collected 100 posts
## ✔ #84: collected 100 posts
## ✔ #85: collected 100 posts
## ✔ #86: collected 100 posts
## ✔ #87: collected 100 posts
## ✔ #88: collected 100 posts
## ✔ #89: collected 100 posts
## ✔ #90: collected 100 posts
## ✔ #91: collected 100 posts
## ✔ #92: collected 100 posts
## ✔ #93: collected 100 posts
## ✔ #94: collected 100 posts
## ✔ #95: collected 100 posts
## ✔ #96: collected 100 posts
## ✔ #97: collected 100 posts
## ✔ #98: collected 100 posts
## ✔ #99: collected 100 posts
## ✔ #100: collected 100 posts

Now we use ts_plot(), which returns a typical ggplot object.

ts_plot(posts, by = "days")

Like any ggplot object, I can make additional modifications:

library(ggplot2)
library(jtools)
ts_plot(posts, by = "days") +
  theme_nice() +
  ggtitle("Number of posts in the `r/whitesox` subreddit over time") +
  xlab(NULL) +
  ylab(NULL)