We’ve built a rudimentary hype index based on the conversations posted in the “r/WallStreetBets” subreddit (WSB). To do so we’ve scrapped conversations from WSB, analyzed for sentiment and calculated a metric for hype.
WSB became famous with the unprecedented stock surge for GameStop caused by viral trading. This phenomenon went on to benefit companies that benefited from the hype generated in their stock by WSB, included Hertz and AMC.
We are not addressing if there is any predictive value in applying sentiment analysis, but rather demonstrating a rudimentary measurement of hype surrounding a stock.
Completing similar social listening across multiple social media platforms could identify future viral trading phenomenons, or become a valuation component for stock analysts using hype to measure positive regard for a stock among retail investors.
Can we demonstrate a rudimentary measurement of hype surrounding a stock? What did we succeed in and what could be improved? How can we extend this measurement for future projects?
We’re using two primary data sources:
Our project write up is organized by OSEMN (pronounced “awesome”), the acronym for Obtain, Scrub, Explore, Model, and iNterpret.
Every member of the team attempted all parts of the project. We combined our efforts to complete the project.
Obtain Scrub Explore Model iNterpret
As an overview, we are collecting data from posts made to WSB. As well, we are taking information from the TTR package to generate a master list of ticker symbols to search for in the posts.
library(RedditExtractoR)
library(TTR)
library(curl)
library(tidyverse)
library(quanteda)
library(readr)
library(dplyr)
library(tm)
library(SnowballC)
library(BatchGetSymbols)
library(lubridate)
library(arsenal)
library(ggplot2)
library(plotly)
library(reshape2)
library(sentimentr)
As an example of how else you can use the RedditExtractoR package, here’s code that allows you to extract all posts with the ticker symbol for Apple Stock, AAPL.
# RedditExtractoR example using AAPL
z <- find_thread_urls(
keywords = "AAPL",
sort_by = "new",
subreddit = "wallstreetbets",
period = "month"
)
## parsing URLs on page 1...
# Pull the underlying comments to the posts
y <- get_thread_content(urls=z$url)
Start by building a codex of all ticker symbols listed on all exchanges.
Then we are going to get all subthreads on WSB indexed by Reddit, sorted by new and set for an hour duration.
Due to power limitations on some machines, here we will fetch only 500 posts.
master <- TTR::stockSymbols(exchange = c("AMEX", "NASDAQ", "NYSE", "ARCA", "BATS", "IEX"))[,c('Name', 'Symbol')]
all_posts <- find_thread_urls(subreddit="wallstreetbets", sort_by="new", period="hour")
## parsing URLs on page 1...
## parsing URLs on page 2...
## parsing URLs on page 3...
## parsing URLs on page 4...
## parsing URLs on page 5...
## parsing URLs on page 6...
## parsing URLs on page 7...
## parsing URLs on page 8...
## parsing URLs on page 9...
## parsing URLs on page 10...
slim_posts <- head(all_posts, n=500L)
Obtain Scrub Explore Model iNterpret
At this point we have our 500 indexed posts, and we can use them to create a corpus.
From there, we’re going to strip numbers and punctuation, and create a window around each of the ticker symbols found in the text.
slim_posts$index <- 1:nrow(slim_posts)
corp <- corpus(slim_posts, docid_field = "index", text_field = "text")
x <- kwic(tokens(corp, remove_punct = TRUE, remove_numbers = TRUE),
pattern = master$Symbol,
window = 8, case_insensitive = FALSE,
)
x$index = x$docname
add_In_Date <- slim_posts[c("index","date_utc")]
rownames(add_In_Date) <- NULL
target <- as.data.frame(x)
target$sentence = paste(target$pre, target$post)
target <- merge(target,add_In_Date,by="index")
target$sentence <- str_replace_all(target$sentence, pattern = '[:punct:]', replacement =" ")
target
At this point we have relatively sanitized data and we will proceed to take the data, run both sides of the window through a sentiment determining routine, and then create a dataframe that has the sentiment, date, and ticker symbol.
target$sentiment <- sentiment(target$sentence)$sentiment
sentimentHolder <- target[c("keyword","date_utc", "sentiment")]
sentimentHolder = setNames(sentimentHolder, c("ticker", "ref.date","sentiment"))
sentimentHolder <- sentimentHolder %>% group_by(ticker, ref.date) %>% summarise(mean(sentiment))
sentimentHolderback <- sentimentHolder
sentimentHolder$ref.date = as.Date(sentimentHolder$ref.date)
sentimentHolder
Fetch the price metrics of all ticker symbols enclosed in the data. Here, the time range is minus one day to plus two days. We also fetch the sentiment data during this time period.
max(sentimentHolder$ref.date)+2
## [1] "2022-05-11"
min(sentimentHolder$ref.date)-1
## [1] "2022-05-01"
l.out <- BatchGetSymbols(tickers = unique(sentimentHolder$ticker),
first.date = as.Date(min(sentimentHolder$ref.date)),
last.date = min(as.Date(max(sentimentHolder$ref.date)+2,Sys.Date())), do.cache=FALSE)
glimpse(l.out)
## List of 2
## $ df.control: tibble [201 x 6] (S3: tbl_df/tbl/data.frame)
## ..$ ticker : chr [1:201] "A" "AAPL" "ABNB" "ADM" ...
## ..$ src : chr [1:201] "yahoo" "yahoo" "yahoo" "yahoo" ...
## ..$ download.status : chr [1:201] "OK" "OK" "OK" "OK" ...
## ..$ total.obs : int [1:201] 7 7 7 7 7 7 7 7 7 7 ...
## ..$ perc.benchmark.dates: num [1:201] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ threshold.decision : chr [1:201] "KEEP" "KEEP" "KEEP" "KEEP" ...
## $ df.tickers:'data.frame': 1407 obs. of 10 variables:
## ..$ price.open : num [1:1407] 118 120 122 125 121 ...
## ..$ price.high : num [1:1407] 120 124 127 125 122 ...
## ..$ price.low : num [1:1407] 116 119 121 121 118 ...
## ..$ price.close : num [1:1407] 120 122 126 122 121 ...
## ..$ volume : num [1:1407] 1756200 3217800 2460400 2123600 1629000 ...
## ..$ price.adjusted : num [1:1407] 120 122 126 122 121 ...
## ..$ ref.date : Date[1:1407], format: "2022-05-02" "2022-05-03" ...
## ..$ ticker : chr [1:1407] "A" "A" "A" "A" ...
## ..$ ret.adjusted.prices: num [1:1407] NA 0.02367 0.03186 -0.03286 -0.00876 ...
## ..$ ret.closing.prices : num [1:1407] NA 0.02367 0.03186 -0.03286 -0.00876 ...
priceData <- l.out$df.tickers
Obtain Scrub Explore Model iNterpret
A glimpse of the sentiment data and the price data:
head(priceData, n=5L)
head(sentimentHolder, n=5L)
Merge the sentiment with the pricing data.
To ensure that we didn’t create any data integrity issues, we will also create a summary.
mergedData <- merge(priceData, sentimentHolder, all.x = TRUE)
mergedData$`mean(sentiment)`[is.na(mergedData$`mean(sentiment)`)] <- 0
mergedData$sentiment <- mergedData$`mean(sentiment)`
summary(comparedf(mergedData, priceData, by = "ticker"))
##
##
## Table: Summary of data.frames
##
## version arg ncol nrow
## -------- ----------- ----- -----
## x mergedData 12 1407
## y priceData 10 1407
##
##
##
## Table: Summary of overall comparison
##
## statistic value
## ------------------------------------------------------------ ------
## Number of by-variables 1
## Number of non-by variables in common 9
## Number of variables compared 9
## Number of variables in x but not y 2
## Number of variables in y but not x 0
## Number of variables compared with some values unequal 9
## Number of variables compared with all values equal 0
## Number of observations in common 9849
## Number of observations in x but not y 0
## Number of observations in y but not x 0
## Number of observations with some compared variables unequal 1407
## Number of observations with all compared variables equal 8442
## Number of values unequal 75426
##
##
##
## Table: Variables not shared
##
## version variable position class
## -------- ---------------- --------- --------
## x mean(sentiment) 11 numeric
## x sentiment 12 numeric
##
##
##
## Table: Other variables not compared
##
##
## --------------------------------
## No other variables not compared
## --------------------------------
##
##
##
## Table: Observations not shared
##
##
## ---------------------------
## No observations not shared
## ---------------------------
##
##
##
## Table: Differences detected by variable
##
## var.x var.y n NAs
## -------------------- -------------------- ----- -----
## ref.date ref.date 8442 0
## price.open price.open 8336 0
## price.high price.high 8362 0
## price.low price.low 8328 0
## price.close price.close 8364 0
## volume volume 8402 0
## price.adjusted price.adjusted 8364 0
## ret.adjusted.prices ret.adjusted.prices 8414 2412
## ret.closing.prices ret.closing.prices 8414 2412
##
##
##
## Table: Differences detected (75376 not shown)
##
## var.x var.y ticker values.x values.y row.x row.y
## ------------ ------------ ------- ----------- ----------- ------ ------
## ref.date ref.date A 2022-05-02 2022-05-03 1 2
## ref.date ref.date A 2022-05-02 2022-05-04 1 3
## ref.date ref.date A 2022-05-02 2022-05-05 1 4
## ref.date ref.date A 2022-05-02 2022-05-06 1 5
## ref.date ref.date A 2022-05-02 2022-05-09 1 6
## ref.date ref.date A 2022-05-02 2022-05-10 1 7
## ref.date ref.date A 2022-05-03 2022-05-02 202 1
## ref.date ref.date A 2022-05-03 2022-05-04 202 3
## ref.date ref.date A 2022-05-03 2022-05-05 202 4
## ref.date ref.date A 2022-05-03 2022-05-06 202 5
## price.open price.open A 118.38 119.72 1 2
## price.open price.open A 118.38 121.69 1 3
## price.open price.open A 118.38 124.62 1 4
## price.open price.open A 118.38 120.88 1 5
## price.open price.open A 118.38 119.13 1 6
## price.open price.open A 118.38 115.44 1 7
## price.open price.open A 119.72 118.38 202 1
## price.open price.open A 119.72 121.69 202 3
## price.open price.open A 119.72 124.62 202 4
## price.open price.open A 119.72 120.88 202 5
## price.high price.high A 120.34 123.98 1 2
## price.high price.high A 120.34 126.69 1 3
## price.high price.high A 120.34 125.21 1 4
## price.high price.high A 120.34 121.88 1 5
## price.high price.high A 120.34 119.13 1 6
## price.high price.high A 120.34 117.94 1 7
## price.high price.high A 123.98 120.34 202 1
## price.high price.high A 123.98 126.69 202 3
## price.high price.high A 123.98 125.21 202 4
## price.high price.high A 123.98 121.88 202 5
## price.low price.low A 116.49 119.09 1 2
## price.low price.low A 116.49 121.44 1 3
## price.low price.low A 116.49 120.8 1 4
## price.low price.low A 116.49 118 1 5
## price.low price.low A 116.49 112.64 1 6
## price.low price.low A 116.49 113.14 1 7
## price.low price.low A 119.09 116.49 202 1
## price.low price.low A 119.09 121.44 202 3
## price.low price.low A 119.09 120.8 202 4
## price.low price.low A 119.09 118 202 5
## price.close price.close A 119.57 122.4 1 2
## price.close price.close A 119.57 126.3 1 3
## price.close price.close A 119.57 122.15 1 4
## price.close price.close A 119.57 121.08 1 5
## price.close price.close A 119.57 113.11 1 6
## price.close price.close A 119.57 116.64 1 7
## price.close price.close A 122.4 119.57 202 1
## price.close price.close A 122.4 126.3 202 3
## price.close price.close A 122.4 122.15 202 4
## price.close price.close A 122.4 121.08 202 5
##
##
##
## Table: Non-identical attributes
##
##
## ----------------------------
## No non-identical attributes
## ----------------------------
head(mergedData, n=5L)
Obtain Scrub Explore Model iNterpret
To help us visualize, create a basic plot of the sentiment measures.
dateSentiment <- mergedData[c("ref.date","ticker", "sentiment")]
t <- ggplot(dateSentiment[dateSentiment$sentiment != 0,]) + aes(x=ref.date, y=sentiment, color = ticker) + geom_boxplot() + stat_summary(fun = "median",
geom = "point",
color = "Orange") +
stat_summary(fun = "mean",
geom = "point",
colour = "red")
plot1 <- ggplotly(t)
# fig <- subplot(plot1, plot1, nrows = 2, shareX = TRUE) %>% layout(hovermode = "x unified")
# fig
plot1
Combine the ticker prices and sentiment into one workspace.
Incorporate price.high, price.low, and associated sentiment as subplots that share the date as the x-axis.
datePrice <- mergedData[c("ref.date","ticker", "price.open", "price.close")]
plot2 <- ggplotly(ggplot(datePrice, aes(x = ref.date, y = price.open, colour = ticker)) +
geom_line(show.legend=FALSE))
plot3 <- ggplotly(ggplot(datePrice, aes(x = ref.date, y = price.close, colour = ticker), show.legend = FALSE) +
geom_line(show.legend = FALSE))
fig <- subplot(plot1, plot2, plot3, nrows = 3, shareX = TRUE) %>% layout(hovermode = "x unified")
fig
Obtain Scrub Explore Model iNterpret
Ultimately we were able to generate a rudimentary measure of hype by applying sentiment analysis to the WSB posts.
Avenues for future improvement could be to refine the measure of hype into a score for any given day based on all of the activity for that day. This would lend itself to a time series analysis by price, or price offset by total market performance or market sector performance.
One way to extend this measurement to future projects would be to assess how well the sentiment analysis labeled the posts. For instance a manual labeling of posts along with a document term matrix could identify additional phrases, like ‘to the moon’, or ‘stonk’, not recognized by standard sentiment packages.
One challenge we had with the code was that by working on different platforms (Windows/Mac; Firefox/Chrome) it meant our code was not always interchangeable. One way we could have addressed that would have been to do each of our initial coding in dockers.
Please also check out the shiny app included in this!