Reddit WallStreetBets Posts Analysis using Topic Modeling

Group 5: Nathan Murzello, Narender Reddy Konuganti, Sai Charan Pappala and Idania Viton

9/12/2022

Data:

Reddit is a social media network and content analyzed is text based, and centered on discussion by users.
We will use Reddit API to scrape WallStreetBets Posts data which will be used as primary data source.
We will utilize some secondary data from kaggle to gather financial news headlines (CNBC, Guardian, Reuters).

Fast-growing Reddit discussion board Wallstreetbets is impacting the meteoric rise/fall in the share price of stock.
- Some controversy over GME stock, and feelings of retail investors vs the institution.
- Identify the impact of various document-level variables on Wallstreetbets data to better understand the context of data.
- What is talked about in Reddit vs pushed in news headlines.
Interesting because of popularity of social media, and power of viral posts.

Quantitative analysis of textual data:
- Create corpus (static container of texts)
- Identify document level variables & Generate Tokens
- Construct Document-feature matrix (“documents” in rows and “features” as columns)
- Relative frequency analysis (keyness)
Advanced topic modeling
- CTM
- STM
We will try to implement other advanced topic models like keyATM Covariates, Seeded LDA etc..,

Evaluate the model using metrics like the residuals, the semantic coherence of the topics, the likelihood for held-out datasets etc..,
Create diagnostic plots using these quantities to understand how the models are performing at various numbers of topics.
Evaluate the correlation found between the topics from different sources and compare with real world knowledge.