ANLY512 - Final Project

---
title: "ANLY512 - Final Project"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
    source_code: embed
---

```{r setup, include=FALSE}
library(flexdashboard)
library(quantmod)
library(plyr)
library(ggplot2)
library(dplyr)
library(ggalt)
library(wordcloud)
library(RColorBrewer)
library(tm)
library(wordcloud2)

reddit_stock <- read.csv("submissions.csv")
SPY <- read.csv("SPY.csv")
reddit_comment_ticker <- read.csv("comment_tickers.csv")
cleanup <- theme(panel.grid.major = element_blank(), #no grid lines
                panel.grid.minor = element_blank(), #no grid lines
                panel.background = element_blank(), #no background
                axis.line.x = element_line(color = 'black'), #black x axis line
                axis.line.y = element_line(color = 'black'), #black y axis line
                legend.key = element_rect(fill = 'white'), #no legend background
                text = element_text(size = 15)) #bigger text size
```

## Interactive Word Cloud {.storyboard data-orientation=rows data-icon="fa_list" data-width=500}

### Summary {data-height=300 data-width=500}

I will follow specific social media for financial market information in my daily life\
The Dashboard is visulizing data analysis results regarding sub-reddit "r/stock"(https://www.reddit.com/r/stocks/) daily posts from 2022-10-25 to 2022-12-02\
\
* Data collection: \
    -- Reddit posts and comments data are collected from Reddit API:https://praw.readthedocs.io/en/stable/ \
    -- SPY price data is collected from Yahoo finance\
* Data storage:\
    -- Data stored in my local computer using .csv format\
* Data manipulation:\
    -- Content of each post are cleaned and tokenized using NLP techniques\
* Data visulization:\
    -- Mainly relies on ggplot and wordcloud packages\
\

Question 1:\
I am interested in what specific terms were mentioned in "r/stock" sub-reddit in recent 2 months\
-- By click each term in the word cloud, I can easily find the specific term frequency\
-- Key terms including "Inflation", "Buy", "Growth", "Earnings" give me a general picture about how people evaluate the current market

### Interactive Word Cloud {data-height=600, data-width=500, fig.height=8, fig.width=15}

```{r message=FALSE}
RedditContent <- reddit_stock$content

# Transform docs
docs <- Corpus(VectorSource(RedditContent))

# Cleaning text
docs <- docs %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)

docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("year", "years", "day",
                                    "month", "quarter", "friday",
                                    "can", "days", "todays"))

# Calculate TermDoc Matrix
dtm <- TermDocumentMatrix(docs) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)

set.seed(1234) # for reproducibility 
wordcloud2(data=df, size=1, color='random-light')
```

## Analysis {.tabset .tabset-fade}

### Number of comments {.storyboard data-orientation=rows data-icon="fa_list"}

#### Question 2 {data-height=150, data-width=500}

I am interested in the distribution of the number of comments under each post\
-- Based on the histogram, most of posts will have less than 100 comments, and posts have more than 250\
comments means relative high popularity to me
\

#### Visulization {data-height=500, data-width=500, fig.height=8, fig.width=15}

```{r message=FALSE}
ggplot(reddit_stock, aes(num_comments)) +
geom_histogram(bins=50)+
labs(x="The number of comments under each post", y="Frequency",
     title="The number of comments histogram")+
     theme(plot.title = element_text(size=12),
           axis.title.x = element_text(size=8, face="bold"),
           axis.title.y = element_text(size=8, face="bold"))+
     cleanup
```

### Comment vs SPY price{.storyboard data-orientation=rows data-icon="fa_list"}

#### Question 3: {data-height=150, data-width=500}

I am interested in the relationship between number of comments and SPY daily price movement\
-- Based on the time series plots, I can find some weak positive relationship\

#### Visualization {data-height=500, data-width=500, fig.height=8, fig.width=15}

```{r message=FALSE}
reddit_stock$date <- as.Date(reddit_stock$date, format="%Y-%m-%d")
SPY$date <- as.Date(SPY$date, format="%Y-%m-%d")

reddit_daily <- reddit_stock %>% group_by(date)%>%
                summarise(tot_comment=mean(num_comments, na.rm=TRUE))

reddit_daily <- merge(x=reddit_daily, y=SPY, by = "date", all.x=TRUE)

p <- ggplot(reddit_daily, 
            aes(x=reddit_daily$date))+
            geom_line(aes(y=tot_comment), size=0.7, color='#f97fa3')+
            geom_point(aes(y=tot_comment), color='#f97fa3')+
            geom_line(aes(y=close/5), size=0.7, color='#b3a1e0')+
            geom_point(aes(y=close/5), color='#b3a1e0')+
            scale_y_continuous(name = "Total number of comments",
            sec.axis = sec_axis(~.*5, name="SPY close price ($)"))+
            theme(plot.title = element_text(size=12),
                  axis.title.y = element_text(color = '#f97fa3',
                                              face="bold",
                                              size=8),
                  axis.title.y.right = element_text(color = '#b3a1e0',
                                                    face="bold",
                                                    size=8)
                  )+
            labs(x="Date",
            title="Comments vs SPY close price time series")+
            cleanup
p
```

### Comments vs SPY return{.storyboard data-orientation=rows data-icon="fa_list"}

#### Question 4: {data-height=150, data-width=500}

I am interested in how daily number of comments are correlated with daily SPY return \
-- Based on the scatter plot, I can find positive correlation after the number of daily comments bigger than 100\

#### Visualization {data-height=500, data-width=500, fig.height=8, fig.width=15}

```{r message=FALSE}
p <- ggplot(reddit_daily, aes(x=tot_comment,y=return))+
     geom_point(alpha=0.5) +
     labs(x= "Total daily comments", y="SPY daily return",
          title="Correlation between total daily comments and SPY daily return")+
     theme(plot.title = element_text(size=12),
           axis.title.x = element_text(size=8, face="bold"),
           axis.title.y = element_text(size=8, face="bold"))+
     geom_smooth()+
     cleanup
p

```

### Comments for stocks {.storyboard data-orientation=rows data-icon="fa_list"}

#### Question 5 {data-height=150, data-width=500}
I am interested in how the number of comments looks like regarding 3 of my favorite stocks: GOOGL, AMZN, and TSLA \
-- Looks like GOOGL receieve relatively more comments for each post compared with AMZN and TSLA \
\

#### Visualization {data-height=500, data-width=500, fig.height=8, fig.width=15}

```{r message=FALSE}
reddit_comment_ticker$ticker = factor(reddit_comment_ticker$ticker)

p <- ggplot(reddit_comment_ticker, aes(x=ticker, y=num_comments,
                                       colour=num_comments)) +
     geom_jitter(width = 0.2, alpha = 0.25)+
     geom_boxplot(alpha = 0.25, coef = 2.5)+
     labs(x="Ticker", y="The number of comments",
          title="The number of comments Box Plot for GOOGL, META, and TSLA")+
     theme(plot.title = element_text(size=12),
           axis.title.x = element_text(size=8, face="bold"),
           axis.title.y = element_text(size=8, face="bold"))+
     cleanup

p

```
Number of comments

Question 2

I am interested in the distribution of the number of comments under each post
– Based on the histogram, most of posts will have less than 100 comments, and posts have more than 250
comments means relative high popularity to me
Visulization

Comments vs SPY return

Question 4:

I am interested in how daily number of comments are correlated with daily SPY return
– Based on the scatter plot, I can find positive correlation after the number of daily comments bigger than 100
Visualization

Comments for stocks

Question 5

I am interested in how the number of comments looks like regarding 3 of my favorite stocks: GOOGL, AMZN, and TSLA
– Looks like GOOGL receieve relatively more comments for each post compared with AMZN and TSLA