Reddit WallStreetBets Posts Prediction and Analysis

Nathan Murzello, Narender Reddy Konuganti, Sai Charan Pappala and Idania Viton

9/12/2022

Intro

r/wallstreetbets is a group of investors who get together to share insights and talk about stocks. it’s interesting and influential because it happened to be a group of 2 million people. After the publicity from the Gamestop stock boom, the membership has ballooned to 10.8 million people.

Problem description

  1. Fast-growing Reddit discussion board Wallstreetbets is impacting the meteoric rise/fall in the share price of stock.
    • Some controversy over GME stock, and feelings of retail investors vs the institution.
    • Identify the impact of various document-level variables on Wallstreetbets data to better understand the context of data.
    • What is talked about in Reddit vs pushed in news headlines.
  2. Interesting because of popularity of social media, and power of viral posts.

Data Summary

We have extracted the data using RedditExtractoR api. We extracted the following fields with the api call.

Data Variables:

Data Extraction

RedditExtractor package is limited. Utilized Pushshift API. Used this API to gather URLs for the top threads in each month in our time frame (2018-2020).

Parameters Used:

Once this is completed we can combine our raw data into one large CSV file.

Data Cleaning

Peer Comments

Are you planning to do Web scrapping to collect data?

Yes, we are planning to do Web scrapping to collect the data. Reddit API or the web scraping we learned in this class.

What is the format of your data and how to you plan to break it down for further analysis?

We will be extracting data in csv format from Reddit API. As we are dealing with unstructured data(text), we would be performing cleaning/preprocessing step like Removal of URLs, punctuation, emojis, Lower casing, Removal of stop-words, Lemmatization, Removal of other non-meaningful characters and proceed further analysis with advanced topic modelling techniques.

What are the document-Level variables you are planning to use?

We are planning to use columns date, score, upvotes, up_ratio, total_awards, comments as document-level variables and analyze the impact of each document-level variable.

Relative Frequency Analysis, Word Cloud

Number of topics

Used the below advanced models:

CTM

STM

CTM vs STM

KeyATM Base vs KeyATM Covariates

Comparison with Secondary data cnbc headlines

We have news headlines from 2018, so we did comparison analysis of the Reddit api data and news headlines for the year 2018 where AMD stock is most trending , hence comparison is primarily focused on AMD stock.