Welcome to the Redditverse
The “Redditverse” isn’t actually a thing (probably for the best), but Reddit is a social media platform that functions as a very, very large discussion board. You can upload text, images, videos, and more. Other users can then react with an “Upvote” (equivalent to a “Like”) or a “Downvote” (similar to a “Dislike”). Users can also comment on the posts or respond to other users. These are called “threads” and somewhat resemble a Twitter thread, but the scale of Reddit threads can be truly astounding. Being one of the biggest forum-based social media platforms, Reddit can offer a plethora of information on a seemingly unlimited amount of topics.
Reddit is made up of millions of Subreddits. These subreddits will look similar to “r/datascience.” Almost any word you think of will follow the “r/”, and this is why Reddit is such a wealth of information (and disinformation).
Time to Set Things Up
RedditExtractoR
Normally, we would have to download things like HTTR and RVEST to interact with the website and scrape the data. However, thanks to Ivan Rivera on Github, there is a package that streamlines the whole process: RedditExtractoR. So, we start the tutorial by importing this package.
library(RedditExtractoR)And that’s it! Apart from actually using the functions within the package, that is the whole setup of RedditExtractoR. For most of this package, you will not need to authenticate yourself with Reddit.
Extracting Some Data
Let’s say that we needed to search for certain instances of a word (or words) being used. From a business perspective, knowing in which the way your product or service is being said would be valuable in performing a sentiment analysis later on. For this example, I want to know where the word “sunchips” has been used.
urls <-
find_thread_urls(keywords = "sunchips")
View(urls)The result is a table with extra metadata to give is more insight into the data itself. In this instance, not many Reddit users mention sunchips, but those who did were scraped. You’ll get the date, thread title, text where “sunchips” were explicitly said, and other valuable information.
One More Function
I think the next function is easily the coolest, but also the most scary. It allows you to extract data on specific users that you provide. If given a thread, you could even loop this command to scrape information on all users who commented on a post(s). First, name a user:
user <- "Poem_for_your_sprog"Then, you can start extracting data on them.
sprog <- get_user_content(user)The amount of information you receive is remarkable. You will receive a list that is comprised of three data frames (about, comments, and active threads). These each contain information on what the account has posted. Yes, that does include old comments as well.
I Lied - More Fun Functions
There are a lot of functions in this package and highlighting every one would make this document a bit longer. Cran, from the R-Project, also published a document on every single function, if you’re interested. Here are two more that could prove to be useful as well:
| Function | Description |
|---|---|
| get_thread_content | Collects two dataframes from a specified thread: with one meta data and the other with every comment in the thread |
| find_subreddits | You can search for any subreddit using a specified keyword. |
Final Note
Overall, the RedditExtractoR package is highly useful and easy to use. The amount of data that you can collect without having to manually interact with Reddit’s API is really cool, but it also serves as a reminder to always watch would you put out on the internet. This basic tool could find every comment and post your account made, so be nice out there!