Nathan Murzello, Narender Reddy Konuganti, Sai Charan Pappala and Idania Viton
9/12/2022
r/wallstreetbets is a group of investors who get together to share insights and talk about stocks. it’s interesting and influential because it happened to be a group of 2 million people. After the publicity from the Gamestop stock boom, the membership has ballooned to 10.8 million people.
We have extracted the data using RedditExtractoR api. We extracted the following fields with the api call.
Data Variables:
RedditExtractor package is limited. Utilized Pushshift API. Used this API to gather URLs for the top threads in each month in our time frame (2018-2020).
Parameters Used:
Once this is completed we can combine our raw data into one large CSV file.
Standard steps, removing duplicates, missing values, etc.
Removed records with punctuation, emojis, urls, reddit formatting.
Format raw data to create final document.
Are you planning to do Web scrapping to collect data?
Yes, we are planning to do Web scrapping to collect the data. Reddit API or the web scraping we learned in this class.
What is the format of your data and how to you plan to break it down for further analysis?
We will be extracting data in csv format from Reddit API. As we are dealing with unstructured data(text), we would be performing cleaning/preprocessing step like Removal of URLs, punctuation, emojis, Lower casing, Removal of stop-words, Lemmatization, Removal of other non-meaningful characters and proceed further analysis with advanced topic modelling techniques.
What are the document-Level variables you are planning to use?
We are planning to use columns date, score, upvotes, up_ratio, total_awards, comments as document-level variables and analyze the impact of each document-level variable.
We have news headlines from 2018, so we did comparison analysis of the Reddit api data and news headlines for the year 2018 where AMD stock is most trending , hence comparison is primarily focused on AMD stock.