I will follow specific social media for financial market information
in my daily life
The Dashboard is visulizing data analysis results regarding sub-reddit
“r/stock”(https://www.reddit.com/r/stocks/) daily posts from
2022-10-25 to 2022-12-02
* Data collection:
– Reddit posts and comments data are collected from Reddit API:https://praw.readthedocs.io/en/stable/
– SPY price data is collected from Yahoo finance
* Data storage:
– Data stored in my local computer using .csv format
* Data manipulation:
– Content of each post are cleaned and tokenized using NLP
techniques
* Data visulization:
– Mainly relies on ggplot and wordcloud packages
Question 1:
I am interested in what specific terms were mentioned in “r/stock”
sub-reddit in recent 2 months
– By click each term in the word cloud, I can easily find the specific
term frequency
– Key terms including “Inflation”, “Buy”, “Growth”, “Earnings” give me a
general picture about how people evaluate the current market
I am interested in the distribution of the number of comments under
each post
– Based on the histogram, most of posts will have less than 100
comments, and posts have more than 250
comments means relative high popularity to me
I am interested in how daily number of comments are correlated with
daily SPY return
– Based on the scatter plot, I can find positive correlation after the
number of daily comments bigger than 100
I am interested in how the number of comments looks like regarding 3
of my favorite stocks: GOOGL, AMZN, and TSLA
– Looks like GOOGL receieve relatively more comments for each post
compared with AMZN and TSLA
---
title: "ANLY512 - Final Project"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: fill
source_code: embed
---
```{r setup, include=FALSE}
library(flexdashboard)
library(quantmod)
library(plyr)
library(ggplot2)
library(dplyr)
library(ggalt)
library(wordcloud)
library(RColorBrewer)
library(tm)
library(wordcloud2)
reddit_stock <- read.csv("submissions.csv")
SPY <- read.csv("SPY.csv")
reddit_comment_ticker <- read.csv("comment_tickers.csv")
cleanup <- theme(panel.grid.major = element_blank(), #no grid lines
panel.grid.minor = element_blank(), #no grid lines
panel.background = element_blank(), #no background
axis.line.x = element_line(color = 'black'), #black x axis line
axis.line.y = element_line(color = 'black'), #black y axis line
legend.key = element_rect(fill = 'white'), #no legend background
text = element_text(size = 15)) #bigger text size
```
## Interactive Word Cloud {.storyboard data-orientation=rows data-icon="fa_list" data-width=500}
### Summary {data-height=300 data-width=500}
I will follow specific social media for financial market information in my daily life\
The Dashboard is visulizing data analysis results regarding sub-reddit "r/stock"(https://www.reddit.com/r/stocks/) daily posts from 2022-10-25 to 2022-12-02\
\
* Data collection: \
-- Reddit posts and comments data are collected from Reddit API:https://praw.readthedocs.io/en/stable/ \
-- SPY price data is collected from Yahoo finance\
* Data storage:\
-- Data stored in my local computer using .csv format\
* Data manipulation:\
-- Content of each post are cleaned and tokenized using NLP techniques\
* Data visulization:\
-- Mainly relies on ggplot and wordcloud packages\
\
Question 1:\
I am interested in what specific terms were mentioned in "r/stock" sub-reddit in recent 2 months\
-- By click each term in the word cloud, I can easily find the specific term frequency\
-- Key terms including "Inflation", "Buy", "Growth", "Earnings" give me a general picture about how people evaluate the current market
### Interactive Word Cloud {data-height=600, data-width=500, fig.height=8, fig.width=15}
```{r message=FALSE}
RedditContent <- reddit_stock$content
# Transform docs
docs <- Corpus(VectorSource(RedditContent))
# Cleaning text
docs <- docs %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("year", "years", "day",
"month", "quarter", "friday",
"can", "days", "todays"))
# Calculate TermDoc Matrix
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
set.seed(1234) # for reproducibility
wordcloud2(data=df, size=1, color='random-light')
```
## Analysis {.tabset .tabset-fade}
### Number of comments {.storyboard data-orientation=rows data-icon="fa_list"}
#### Question 2 {data-height=150, data-width=500}
I am interested in the distribution of the number of comments under each post\
-- Based on the histogram, most of posts will have less than 100 comments, and posts have more than 250\
comments means relative high popularity to me
\
#### Visulization {data-height=500, data-width=500, fig.height=8, fig.width=15}
```{r message=FALSE}
ggplot(reddit_stock, aes(num_comments)) +
geom_histogram(bins=50)+
labs(x="The number of comments under each post", y="Frequency",
title="The number of comments histogram")+
theme(plot.title = element_text(size=12),
axis.title.x = element_text(size=8, face="bold"),
axis.title.y = element_text(size=8, face="bold"))+
cleanup
```
### Comment vs SPY price{.storyboard data-orientation=rows data-icon="fa_list"}
#### Question 3: {data-height=150, data-width=500}
I am interested in the relationship between number of comments and SPY daily price movement\
-- Based on the time series plots, I can find some weak positive relationship\
#### Visualization {data-height=500, data-width=500, fig.height=8, fig.width=15}
```{r message=FALSE}
reddit_stock$date <- as.Date(reddit_stock$date, format="%Y-%m-%d")
SPY$date <- as.Date(SPY$date, format="%Y-%m-%d")
reddit_daily <- reddit_stock %>% group_by(date)%>%
summarise(tot_comment=mean(num_comments, na.rm=TRUE))
reddit_daily <- merge(x=reddit_daily, y=SPY, by = "date", all.x=TRUE)
p <- ggplot(reddit_daily,
aes(x=reddit_daily$date))+
geom_line(aes(y=tot_comment), size=0.7, color='#f97fa3')+
geom_point(aes(y=tot_comment), color='#f97fa3')+
geom_line(aes(y=close/5), size=0.7, color='#b3a1e0')+
geom_point(aes(y=close/5), color='#b3a1e0')+
scale_y_continuous(name = "Total number of comments",
sec.axis = sec_axis(~.*5, name="SPY close price ($)"))+
theme(plot.title = element_text(size=12),
axis.title.y = element_text(color = '#f97fa3',
face="bold",
size=8),
axis.title.y.right = element_text(color = '#b3a1e0',
face="bold",
size=8)
)+
labs(x="Date",
title="Comments vs SPY close price time series")+
cleanup
p
```
### Comments vs SPY return{.storyboard data-orientation=rows data-icon="fa_list"}
#### Question 4: {data-height=150, data-width=500}
I am interested in how daily number of comments are correlated with daily SPY return \
-- Based on the scatter plot, I can find positive correlation after the number of daily comments bigger than 100\
#### Visualization {data-height=500, data-width=500, fig.height=8, fig.width=15}
```{r message=FALSE}
p <- ggplot(reddit_daily, aes(x=tot_comment,y=return))+
geom_point(alpha=0.5) +
labs(x= "Total daily comments", y="SPY daily return",
title="Correlation between total daily comments and SPY daily return")+
theme(plot.title = element_text(size=12),
axis.title.x = element_text(size=8, face="bold"),
axis.title.y = element_text(size=8, face="bold"))+
geom_smooth()+
cleanup
p
```
### Comments for stocks {.storyboard data-orientation=rows data-icon="fa_list"}
#### Question 5 {data-height=150, data-width=500}
I am interested in how the number of comments looks like regarding 3 of my favorite stocks: GOOGL, AMZN, and TSLA \
-- Looks like GOOGL receieve relatively more comments for each post compared with AMZN and TSLA \
\
#### Visualization {data-height=500, data-width=500, fig.height=8, fig.width=15}
```{r message=FALSE}
reddit_comment_ticker$ticker = factor(reddit_comment_ticker$ticker)
p <- ggplot(reddit_comment_ticker, aes(x=ticker, y=num_comments,
colour=num_comments)) +
geom_jitter(width = 0.2, alpha = 0.25)+
geom_boxplot(alpha = 0.25, coef = 2.5)+
labs(x="Ticker", y="The number of comments",
title="The number of comments Box Plot for GOOGL, META, and TSLA")+
theme(plot.title = element_text(size=12),
axis.title.x = element_text(size=8, face="bold"),
axis.title.y = element_text(size=8, face="bold"))+
cleanup
p
```
Comment vs SPY price
Question 3:
I am interested in the relationship between number of comments and SPY daily price movement
– Based on the time series plots, I can find some weak positive relationship
Visualization