10/3/2021

Recap

  • Get data from Reddit API as well as Kaggle

  • Gain insights and follow the trends on the stock market using the wallstreetbets subreddit

  • Perform quantitative analysis (using tokens, DFM, and various quanteda tools)

  • Compare different approaches Correlated Topic Model (CTM) Structure Topic Model (STM) Keyword Assisted Topic Model Biterm Topic Model (BTM)

Peer comments

  • What type of data do you plan to get from Kaggle? How do you plan to limit the data?
    Our initial plan was to include Kaggle data in case we were limited by Reddit API. But we were able to extract 142,737 ‘documents’ for past one month alone using API. So we didn’t include Kaggle data.

  • What specific models will you use?
    We ended up using CTM, STM and Keyword Assisted Topic Model for this project

  • Is the group focusing on a specific category of stocks? If yes, which category would it be?
    We focused on the top dicussed stocks among the community in the last month

Data Summary and Exploration

The primary source of our data is Reddit API. We extracted the top posts from wallstreetbet subreddit for the past month to get insight into the trending stocks among the Reddit community.

We then extracted all the comments from each post to get exhaustive data containing 142,737 comments

By looking at the top features in the topics of the posts, we found that the top 5 stocks discussed among the reddit community in the past month were CLOV (Clover Health Investments Corp), SDC (SmileDirectClub Inc), GME (GameStop Corp.), IRNT (Ironnet Inc) and EVERGRANDE.

The total number of comments related to each stock trending in the last month

GME was the most discussed stock. The recent debt crisis of Evergrande also seems to be discussed in the community.

Topic Modeling

Identifying some topics from all the comments posted by users

Correlated Topic Model - Number of topics = 5

Structural Topic Model (max. iteration 20 to limit runtime)

Number of topics = 5; Document variable = date

Both CTM and STM gave topics with similar keywords

Comparing topic quality in STM and CTM

Overall, we see that STM has a slightly higher semantic coherence and exclusivity. This suggests that when only considering 5 topics, STM has a better performance than CTM.

Keyword Assisted Topic Model

Keywords:

Fitting a keyATM Base model with the keyword sets and allowing 3 no-keyword topics

Conclusion:

With this analysis, we wanted to understand how the WallStreetBets writers and readers talked about stocks during the last month. Gamestop still remains the most discussed stock among the reddit community. We found topics related to buying, selling, and stock markets in general that can help us get the insight about their discussion