Reddit WallStreetBets Posts Prediction and Analysis

Nathan Murzello, Narender Reddy Konuganti, Sai Charan Pappala and Idania Viton

9/12/2022

Intro

r/wallstreetbets is a group of investors who get together to share insights and talk about stocks. it’s interesting and influential because it happened to be a group of 2 million people. After the publicity from the Gamestop stock boom, the membership has ballooned to 10.8 million people.

Problem description

Fast-growing Reddit discussion board Wallstreetbets is impacting the meteoric rise/fall in the share price of stock.
- Some controversy over GME stock, and feelings of retail investors vs the institution.
- Identify the impact of various document-level variables on Wallstreetbets data to better understand the context of data.
- What is talked about in Reddit vs pushed in news headlines.
Interesting because of popularity of social media, and power of viral posts.

Analyze WallStreetBets posts.
Training many topic models at one time
Evaluate topic models and understand the model diagnostics
Explore and interpret the content of topic model(s).

Data Summary

We have extracted the data using RedditExtractoR api. We extracted the following fields with the api call.

Data Variables:

final_document: document for each post. Has the post title, post text, and all of the posts comments combined.
date: the date the post was uploaded.
score: number of upvotes
up_ratio: percentage of upvotes compared to number of total votes.
total_awards_received: number of awards given to a post.
comments: number of comments the post had.

Data Extraction

RedditExtractor package is limited. Utilized Pushshift API. Used this API to gather URLs for the top threads in each month in our time frame (2018-2020).

Parameters Used:

subreddit: limit which subreddits posts are returned
after: only return posts created after this timestamp
before: only return posts created before this timestamp
size: number of posts to return (default is 25 so useful to specify a larger number)
sorttype: how we want to sort the posts (score, date, etc.)
sort: whether to sort in ascending or descending.
fields: what data for each posts we want returned, ex: title, text
Once URLs are gathered for the posts, we can use RedditExtractor package to scrape thread content, and comments.

Once this is completed we can combine our raw data into one large CSV file.

Data Cleaning

Standard steps, removing duplicates, missing values, etc.
Removed records with punctuation, emojis, urls, reddit formatting.
Format raw data to create final document.
- myStopWords = c(“one”, “two”, “three”, “first”, “second”, “third”, “may”, “also”, “can”, “whether”, “just”, “like”, “got”,“will”, “A”,“at”, “but”,“For”, “in”, “for”,“how”,“How”,“several”, “Tue”, “but”, “doesnt”, “from”, “The”,“From”, “like”, “just”, “shit”, “fuck”,“can”, “have”, “been”, “has”, “than”,“with”,“To”, “use”,“who”,“of”,“to”,“show”,“and”, “include”, “includes”, “including”, “included”, “uses”, “using”, “used”, “comprises”, “on”,“said”,“were”,“by”,“that”,“is”, “as”,“was”,“an”,“it”,“which”,“its”, “had”,“are”,“they”,“he”,“be”,“us”)

Peer Comments

Are you planning to do Web scrapping to collect data?

Yes, we are planning to do Web scrapping to collect the data. Reddit API or the web scraping we learned in this class.

What is the format of your data and how to you plan to break it down for further analysis?

We will be extracting data in csv format from Reddit API. As we are dealing with unstructured data(text), we would be performing cleaning/preprocessing step like Removal of URLs, punctuation, emojis, Lower casing, Removal of stop-words, Lemmatization, Removal of other non-meaningful characters and proceed further analysis with advanced topic modelling techniques.

What are the document-Level variables you are planning to use?

We are planning to use columns date, score, upvotes, up_ratio, total_awards, comments as document-level variables and analyze the impact of each document-level variable.

Relative Frequency Analysis, Word Cloud

Used the below advanced models:

CTM
STM
KeyATM base
KeyATM covariates

KeyATM Base vs KeyATM Covariates

top_stocks = c(“amd”, “intel”,“apple”,“microsoft”,“gme”,“blockbuster”,“amc”,“gamestock”,“tesla”,“mu”),
options = c(“gain”, “loss”, “options”, “profit”,“short”,“calls”,“put”,“call”,“puts”),
stock_market = c(“stock”,“short”,“share”,“price”,“market”,“money”),
buy_sell = c(“buy”,“dip”,“sell”,“order”,“bought”,“buying”,“hold”,“line”),
trading_apps = c(“robinhood”,“account”,“app”,“trade”,“webull”,“trading”,“fidelity”,“platform”)

Comparison with Secondary data cnbc headlines

We have news headlines from 2018, so we did comparison analysis of the Reddit api data and news headlines for the year 2018 where AMD stock is most trending , hence comparison is primarily focused on AMD stock.