Numbers and symbols related to a trading market in orange, green, white and red

Part 1 - Introduction

Abstract

We’ve built a rudimentary hype index based on the conversations posted in the “r/WallStreetBets” subreddit (WSB). To do so we’ve scrapped conversations from WSB, analyzed for sentiment and calculated a metric for hype.

WSB became famous with the unprecedented stock surge for GameStop caused by viral trading. This phenomenon went on to benefit companies that benefited from the hype generated in their stock by WSB, included Hertz and AMC.

We are not addressing if there is any predictive value in applying sentiment analysis, but rather demonstrating a rudimentary measurement of hype surrounding a stock.


Motivation for doing this work

Completing similar social listening across multiple social media platforms could identify future viral trading phenomenons, or become a valuation component for stock analysts using hype to measure positive regard for a stock among retail investors.


Question we’re seeking to answer

Can we demonstrate a rudimentary measurement of hype surrounding a stock? What did we succeed in and what could be improved? How can we extend this measurement for future projects?


Where we’re sourcing our data

We’re using two primary data sources:


Overall project flow

Our project write up is organized by OSEMN (pronounced “awesome”), the acronym for Obtain, Scrub, Explore, Model, and iNterpret.


Roles and Responsibilities

Every member of the team attempted all parts of the project. We combined our efforts to complete the project.



Part 2 - Data Collection

Obtain Scrub Explore Model iNterpret

As an overview, we are collecting data from posts made to WSB. As well, we are taking information from the TTR package to generate a master list of ticker symbols to search for in the posts.


Libraries

  1. RedditExtractoR for scraping reddit posts
  2. ttr for quant trading info & ticker lists
  3. BatchGetSymbols for symbol look up
  4. Various Plotting tools
library(RedditExtractoR)
library(TTR)
library(curl)
library(tidyverse)
library(quanteda)
library(readr)

library(dplyr)

library(tm)

library(SnowballC)
library(BatchGetSymbols)

library(lubridate)

library(arsenal)

library(ggplot2)
library(plotly)
library(reshape2)

library(sentimentr)

Getting Started

RedditExtractoR Example

As an example of how else you can use the RedditExtractoR package, here’s code that allows you to extract all posts with the ticker symbol for Apple Stock, AAPL.

# RedditExtractoR example using AAPL         
z <- find_thread_urls(
  keywords = "AAPL",
  sort_by = "new",
  subreddit = "wallstreetbets",
  period = "month"
)
## parsing URLs on page 1...
# Pull the underlying comments to the posts
y <- get_thread_content(urls=z$url)


Start by building a codex of all ticker symbols listed on all exchanges.

Then we are going to get all subthreads on WSB indexed by Reddit, sorted by new and set for an hour duration.

Due to power limitations on some machines, here we will fetch only 500 posts.

 master <- TTR::stockSymbols(exchange = c("AMEX", "NASDAQ", "NYSE", "ARCA", "BATS", "IEX"))[,c('Name', 'Symbol')]
  all_posts <- find_thread_urls(subreddit="wallstreetbets", sort_by="new", period="hour")
## parsing URLs on page 1...
## parsing URLs on page 2...
## parsing URLs on page 3...
## parsing URLs on page 4...
## parsing URLs on page 5...
## parsing URLs on page 6...
## parsing URLs on page 7...
## parsing URLs on page 8...
## parsing URLs on page 9...
## parsing URLs on page 10...
  slim_posts <- head(all_posts, n=500L)

Creating a corpus

Obtain Scrub Explore Model iNterpret

At this point we have our 500 indexed posts, and we can use them to create a corpus.

From there, we’re going to strip numbers and punctuation, and create a window around each of the ticker symbols found in the text.

 slim_posts$index <- 1:nrow(slim_posts)
  corp <- corpus(slim_posts, docid_field = "index", text_field = "text")
  x <- kwic(tokens(corp, remove_punct = TRUE, remove_numbers = TRUE), 
            pattern = master$Symbol,
            window = 8, case_insensitive = FALSE,
  )
  x$index = x$docname
  add_In_Date <- slim_posts[c("index","date_utc")]
  rownames(add_In_Date) <- NULL
  target <- as.data.frame(x)
  target$sentence = paste(target$pre, target$post)
  target <- merge(target,add_In_Date,by="index")
  
 target$sentence <- str_replace_all(target$sentence, pattern = '[:punct:]',  replacement =" ")
 target

Sentiment Break ups

At this point we have relatively sanitized data and we will proceed to take the data, run both sides of the window through a sentiment determining routine, and then create a dataframe that has the sentiment, date, and ticker symbol.

target$sentiment <- sentiment(target$sentence)$sentiment

sentimentHolder <- target[c("keyword","date_utc", "sentiment")]
sentimentHolder = setNames(sentimentHolder, c("ticker", "ref.date","sentiment"))

sentimentHolder <- sentimentHolder %>% group_by(ticker, ref.date) %>% summarise(mean(sentiment))

sentimentHolderback <- sentimentHolder
sentimentHolder$ref.date = as.Date(sentimentHolder$ref.date)
sentimentHolder

Getting the price index

Fetch the price metrics of all ticker symbols enclosed in the data. Here, the time range is minus one day to plus two days. We also fetch the sentiment data during this time period.

max(sentimentHolder$ref.date)+2
## [1] "2022-05-11"
min(sentimentHolder$ref.date)-1
## [1] "2022-05-01"
l.out <- BatchGetSymbols(tickers = unique(sentimentHolder$ticker),
                         first.date = as.Date(min(sentimentHolder$ref.date)),
                        last.date = min(as.Date(max(sentimentHolder$ref.date)+2,Sys.Date())), do.cache=FALSE)
glimpse(l.out)
## List of 2
##  $ df.control: tibble [201 x 6] (S3: tbl_df/tbl/data.frame)
##   ..$ ticker              : chr [1:201] "A" "AAPL" "ABNB" "ADM" ...
##   ..$ src                 : chr [1:201] "yahoo" "yahoo" "yahoo" "yahoo" ...
##   ..$ download.status     : chr [1:201] "OK" "OK" "OK" "OK" ...
##   ..$ total.obs           : int [1:201] 7 7 7 7 7 7 7 7 7 7 ...
##   ..$ perc.benchmark.dates: num [1:201] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ threshold.decision  : chr [1:201] "KEEP" "KEEP" "KEEP" "KEEP" ...
##  $ df.tickers:'data.frame':  1407 obs. of  10 variables:
##   ..$ price.open         : num [1:1407] 118 120 122 125 121 ...
##   ..$ price.high         : num [1:1407] 120 124 127 125 122 ...
##   ..$ price.low          : num [1:1407] 116 119 121 121 118 ...
##   ..$ price.close        : num [1:1407] 120 122 126 122 121 ...
##   ..$ volume             : num [1:1407] 1756200 3217800 2460400 2123600 1629000 ...
##   ..$ price.adjusted     : num [1:1407] 120 122 126 122 121 ...
##   ..$ ref.date           : Date[1:1407], format: "2022-05-02" "2022-05-03" ...
##   ..$ ticker             : chr [1:1407] "A" "A" "A" "A" ...
##   ..$ ret.adjusted.prices: num [1:1407] NA 0.02367 0.03186 -0.03286 -0.00876 ...
##   ..$ ret.closing.prices : num [1:1407] NA 0.02367 0.03186 -0.03286 -0.00876 ...
priceData <- l.out$df.tickers 

Part 3 - View Data

Obtain Scrub Explore Model iNterpret

Looking in

A glimpse of the sentiment data and the price data:

head(priceData, n=5L)
head(sentimentHolder, n=5L)

The Merge

Merge the sentiment with the pricing data.

To ensure that we didn’t create any data integrity issues, we will also create a summary.

mergedData <- merge(priceData, sentimentHolder, all.x = TRUE)
mergedData$`mean(sentiment)`[is.na(mergedData$`mean(sentiment)`)] <- 0
mergedData$sentiment <- mergedData$`mean(sentiment)`
summary(comparedf(mergedData, priceData, by = "ticker"))
## 
## 
## Table: Summary of data.frames
## 
## version   arg           ncol   nrow
## --------  -----------  -----  -----
## x         mergedData      12   1407
## y         priceData       10   1407
## 
## 
## 
## Table: Summary of overall comparison
## 
## statistic                                                      value
## ------------------------------------------------------------  ------
## Number of by-variables                                             1
## Number of non-by variables in common                               9
## Number of variables compared                                       9
## Number of variables in x but not y                                 2
## Number of variables in y but not x                                 0
## Number of variables compared with some values unequal              9
## Number of variables compared with all values equal                 0
## Number of observations in common                                9849
## Number of observations in x but not y                              0
## Number of observations in y but not x                              0
## Number of observations with some compared variables unequal     1407
## Number of observations with all compared variables equal        8442
## Number of values unequal                                       75426
## 
## 
## 
## Table: Variables not shared
## 
## version   variable           position  class   
## --------  ----------------  ---------  --------
## x         mean(sentiment)          11  numeric 
## x         sentiment                12  numeric 
## 
## 
## 
## Table: Other variables not compared
## 
##                                  
##  --------------------------------
##  No other variables not compared 
##  --------------------------------
## 
## 
## 
## Table: Observations not shared
## 
##                             
##  ---------------------------
##  No observations not shared 
##  ---------------------------
## 
## 
## 
## Table: Differences detected by variable
## 
## var.x                 var.y                     n    NAs
## --------------------  --------------------  -----  -----
## ref.date              ref.date               8442      0
## price.open            price.open             8336      0
## price.high            price.high             8362      0
## price.low             price.low              8328      0
## price.close           price.close            8364      0
## volume                volume                 8402      0
## price.adjusted        price.adjusted         8364      0
## ret.adjusted.prices   ret.adjusted.prices    8414   2412
## ret.closing.prices    ret.closing.prices     8414   2412
## 
## 
## 
## Table: Differences detected (75376 not shown)
## 
## var.x         var.y         ticker   values.x     values.y      row.x   row.y
## ------------  ------------  -------  -----------  -----------  ------  ------
## ref.date      ref.date      A        2022-05-02   2022-05-03        1       2
## ref.date      ref.date      A        2022-05-02   2022-05-04        1       3
## ref.date      ref.date      A        2022-05-02   2022-05-05        1       4
## ref.date      ref.date      A        2022-05-02   2022-05-06        1       5
## ref.date      ref.date      A        2022-05-02   2022-05-09        1       6
## ref.date      ref.date      A        2022-05-02   2022-05-10        1       7
## ref.date      ref.date      A        2022-05-03   2022-05-02      202       1
## ref.date      ref.date      A        2022-05-03   2022-05-04      202       3
## ref.date      ref.date      A        2022-05-03   2022-05-05      202       4
## ref.date      ref.date      A        2022-05-03   2022-05-06      202       5
## price.open    price.open    A        118.38       119.72            1       2
## price.open    price.open    A        118.38       121.69            1       3
## price.open    price.open    A        118.38       124.62            1       4
## price.open    price.open    A        118.38       120.88            1       5
## price.open    price.open    A        118.38       119.13            1       6
## price.open    price.open    A        118.38       115.44            1       7
## price.open    price.open    A        119.72       118.38          202       1
## price.open    price.open    A        119.72       121.69          202       3
## price.open    price.open    A        119.72       124.62          202       4
## price.open    price.open    A        119.72       120.88          202       5
## price.high    price.high    A        120.34       123.98            1       2
## price.high    price.high    A        120.34       126.69            1       3
## price.high    price.high    A        120.34       125.21            1       4
## price.high    price.high    A        120.34       121.88            1       5
## price.high    price.high    A        120.34       119.13            1       6
## price.high    price.high    A        120.34       117.94            1       7
## price.high    price.high    A        123.98       120.34          202       1
## price.high    price.high    A        123.98       126.69          202       3
## price.high    price.high    A        123.98       125.21          202       4
## price.high    price.high    A        123.98       121.88          202       5
## price.low     price.low     A        116.49       119.09            1       2
## price.low     price.low     A        116.49       121.44            1       3
## price.low     price.low     A        116.49       120.8             1       4
## price.low     price.low     A        116.49       118               1       5
## price.low     price.low     A        116.49       112.64            1       6
## price.low     price.low     A        116.49       113.14            1       7
## price.low     price.low     A        119.09       116.49          202       1
## price.low     price.low     A        119.09       121.44          202       3
## price.low     price.low     A        119.09       120.8           202       4
## price.low     price.low     A        119.09       118             202       5
## price.close   price.close   A        119.57       122.4             1       2
## price.close   price.close   A        119.57       126.3             1       3
## price.close   price.close   A        119.57       122.15            1       4
## price.close   price.close   A        119.57       121.08            1       5
## price.close   price.close   A        119.57       113.11            1       6
## price.close   price.close   A        119.57       116.64            1       7
## price.close   price.close   A        122.4        119.57          202       1
## price.close   price.close   A        122.4        126.3           202       3
## price.close   price.close   A        122.4        122.15          202       4
## price.close   price.close   A        122.4        121.08          202       5
## 
## 
## 
## Table: Non-identical attributes
## 
##                              
##  ----------------------------
##  No non-identical attributes 
##  ----------------------------
head(mergedData, n=5L)

Part 4 - Plotting

Obtain Scrub Explore Model iNterpret

To help us visualize, create a basic plot of the sentiment measures.

dateSentiment <- mergedData[c("ref.date","ticker", "sentiment")]
t <- ggplot(dateSentiment[dateSentiment$sentiment != 0,]) + aes(x=ref.date, y=sentiment, color = ticker) + geom_boxplot() + stat_summary(fun = "median",
               geom = "point",
               color = "Orange") +               
              stat_summary(fun = "mean",
               geom = "point", 
               colour = "red")

plot1 <- ggplotly(t)
# fig <- subplot(plot1, plot1, nrows = 2, shareX = TRUE) %>%  layout(hovermode = "x unified")
# fig
plot1

Overall Plot

Combine the ticker prices and sentiment into one workspace.

Incorporate price.high, price.low, and associated sentiment as subplots that share the date as the x-axis.

datePrice <- mergedData[c("ref.date","ticker", "price.open", "price.close")]
plot2 <- ggplotly(ggplot(datePrice, aes(x = ref.date, y = price.open, colour = ticker)) +
  geom_line(show.legend=FALSE))

plot3 <- ggplotly(ggplot(datePrice, aes(x = ref.date, y = price.close, colour = ticker), show.legend = FALSE) + 
  geom_line(show.legend = FALSE)) 
  
fig <- subplot(plot1, plot2, plot3, nrows = 3, shareX = TRUE) %>%  layout(hovermode = "x unified")
fig

Part 5 - Conclusion

Obtain Scrub Explore Model iNterpret

Ultimately we were able to generate a rudimentary measure of hype by applying sentiment analysis to the WSB posts.

Avenues for future improvement could be to refine the measure of hype into a score for any given day based on all of the activity for that day. This would lend itself to a time series analysis by price, or price offset by total market performance or market sector performance.

One way to extend this measurement to future projects would be to assess how well the sentiment analysis labeled the posts. For instance a manual labeling of posts along with a document term matrix could identify additional phrases, like ‘to the moon’, or ‘stonk’, not recognized by standard sentiment packages.