Final Project

Introduction

This project looks at the relationship between daily stock returns and mean twitter sentiment scores.There are lots of research article on the web about the relationship between social media sentiments and stock prices. In this project, we have tried to identify whether there is any relationship between daily returns for three tech stocks and the twitter sentiments on a dialy basis. The stocks we have chosen are Amazon, Apple and Microsoft, all tech stocks. Due to the limitations of the Twitter API, we could only get tweets for the last 10 days and so we have used the closing prices of those days for this project.

The motivation behind this project is to find out whether social media sentiments really do affect stock prices in the short term. If it does, this will be a way for day traders to predict prices over the short term. The research questions we will be answering are:

Do daily returns of Amazon stocks have a statistically significant relationship with mean twitter sentiment scores?
Do daily returns of Amazon stocks have a statistically significant relationship with the count of twitter sentiment?
Answer the above two questions for Apple and Microsoft.

Workflow

This project follows the OSEMN workflow. It is detailed as follows:

Obtain data :

Tweets about AMZN, AAPL and MSFT are obtained through the free Twitter API. We had to register for a developer account and create an app for this. 30,106 tweets for AMZN and 31,620 tweet for AAPL were collected. The dates used were from 27th Nov 2019 to 6th Dec 2019. The twitter data were saved as csv for each stock.

The closing price data was collected using the R package quantmod. This package returns Open, High, Low, Close, Volume and Adjusted Closing price. We have used the Adjusted closing price in our return calculation as it takes care of outliers during the end of the trading day.

It should be mentioned that there only 6 trading days during the period under question and we have limited our analysis to those days.

Scrub/Clean data

The tweets were cleaned using stringr. The dates had to be converted for R to be able to read it.

Explore Data

We did exploratory data analysis on the stock prices to check for volatility. The tweets data had to be checked so we could further clean it up for the sentiment analysis part.

The prices were then plotted with sentiment scores and also frequency to check for relationships.

Model Data

Linear regression models have been used to explore whether the sentiment scores can be used to predict stock prices.

Interpret

The findings are then outlined in the conclusion section.

To know more about OSEMN, visit (https://machinelearningmastery.com/how-to-work-through-a-problem-like-a-data-scientist/)

Challenges faced

There were few challenges faced during the project

Limited Twitter API data: I had access to only 10 days of tweets using the free API. I had to retry on limit so I could get enough observations and that took some time.
Memory Overflow: Initially I wanted to keep all code on the same Rmd file but later decided to save the AAPL and MSFT findings as CSV and in a different file so I could knit (and submit on time!)
I have used cbind several times and had to keep renaming the variables
Had to read in dates as as.Date again everytime I loaded a csv

Load libraries

The following R libraries have been used throughout the project

library(tidyverse)
library(ggplot2)
library(stringr)
library(rtweet)
library(quantmod)
library(SentimentAnalysis)
require(gridExtra)
library(data.table)

Obtain Data

First a connection is created with the twitter API using rtweet

##Creating the connection with rtweet

create_token(
  app = "607",
  consumer_key = "***********",
  consumer_secret = "***********************88",
  access_token = "480323958-6kQk1ST8tm0CA2qLShYtnozd2Y3rJSsePg5ejqci",
  access_secret = "*************************"
)

Then we obtain the Amazon twitter data.We created a loop that iterates through each day to collect tweets, so that out dataset is not dominated by recent tweets. We also use retry on limit as the API has limitations on number of requests for the free version. We have excluded all retweets for this dataset. (https://developer.twitter.com/en/docs/basics/rate-limits)

We get about 90 different variables in the twitter dataset. We will only be using the Date and text comuns for our analysis.

We want to store this dataset as a csv so we can read it in for analysis later. For this purpose, we encode the text column to utf and convert the date using as.date.

#dates variable created to increase by 1, we will use this so we can get an adequate number of tweets for all days. If we do not do this step, the data is dominated by recent tweets
dates <- seq.Date(from = as.Date("2019-12-01"), to = as.Date("2019-12-06"), by =  1)
#Empty dataframe created
df_amzn <- data.frame()
#using loop to get the data using a temporary dataframe, binds to amazon dataframe
for (i in seq_along(dates)) {
  df_temp<-search_tweets("@$amzn OR @AMZNNews", n =15000,
                         lang = 'en', include_rts = FALSE, 
                         retryonratelimit = TRUE)
  df_amzn <- rbind(df_amzn, df_temp)
}

amzn<-df_amzn
#Encode to utf8
amzn$text <- enc2utf8(amzn$text)
#Extract the date and convert using as.date in format "%m/%d/%y. Since we will compare daily data, we do not need the hour and mins part. 
date <- amzn$created_at
date <- str_extract(date, "\\d{4}-\\d{2}-\\d{2}")
date <- as.Date(date)
date <- as.Date(date, format ="%m/%d/%y")
amzn$Date <- date
#Keep only the columns we need
amzn<-subset(amzn, select=c(Date,text))
#Write as csv file, this has been upload at 
write.csv(amzn, file ="amzn.csv", row.names = FALSE, fileEncoding="UTF-8")

In order to get the stock price data, we use quantmod and use yahoo as the data source

#setting dates on which we need the prices
start <- as.Date("2019-11-27")
end <- as.Date("2019-12-06")
getSymbols("AMZN", src = "yahoo", from = start, to = end)

## [1] "AMZN"

price_amzn<-as.data.frame(AMZN)
#Extracing date
price_amzn$Date<-as.Date(index(AMZN))

Clean data

There is a lot of cleanup required in the text column to make it ready for the sentiment analysis package. We cleaned up whitespave, removed punctuation,removed $AMZN which is the handle we used to acquire tweets about the stock, removed emojis, and removed URL.We take a glimpse at the corpus and find it to be much cleaner.

Cleaning up Amazon Data

#read the file into R
amzn<-read.csv('https://raw.githubusercontent.com/zahirf/Data607/master/amzn.csv', stringsAsFactors = FALSE)
class(amzn$Date)

## [1] "character"

#date is read in as character, so we convert
amzn$Date<-as.Date(amzn$Date)
#clean up the text
amzn$text <- gsub("http.*", "", amzn$text)#remove url
amzn$text <- gsub("https.*", "", amzn$text)#remove url
amzn$text <- gsub("&amp;", "&", amzn$text) #remove &
amzn$text <- gsub("$AMZN ", "", amzn$text)#remove handle
amzn$text <- gsub("^[[:space:]]*","",amzn$text) # Remove leading whitespaces
amzn$text <- gsub(" +"," ",amzn$text) #Remove extra whitespaces
amzn$text <- iconv(amzn$text, "latin1", "ASCII", sub="") # Remove emojis
amzn$text <- gsub("\\n", "", amzn$text) #Replace line breaks with ""
amzn$text <- gsub("[[:punct:]]","",amzn$text) # Remove punctuation
amzn$text <- gsub("^[0-9]*$","",amzn$text) # Remove punctuation
glimpse(amzn$text)

##  chr [1:30106] "Amazon  AMZNAmazon bullish for mondayLong or short it on WCX " ...

Explore data

Stock Price

We would like to do some explatory analysis on the stock price data and get an idea about the volatility during the period of research.

Let us look at the summary of the AMZN stock price. We see that the mean of adjusted closing price was 1779, the the price varies from 1819 to 1740.That is about a 3% decline.

summary(price_amzn)

##    AMZN.Open      AMZN.High       AMZN.Low      AMZN.Close  
##  Min.   :1760   Min.   :1764   Min.   :1740   Min.   :1740  
##  1st Qu.:1766   1st Qu.:1777   1st Qu.:1750   1st Qu.:1763  
##  Median :1788   Median :1797   Median :1761   Median :1776  
##  Mean   :1787   Mean   :1797   Mean   :1768   Mean   :1779  
##  3rd Qu.:1804   3rd Qu.:1820   3rd Qu.:1789   3rd Qu.:1796  
##  Max.   :1818   Max.   :1825   Max.   :1801   Max.   :1819  
##   AMZN.Volume      AMZN.Adjusted       Date           
##  Min.   :1923400   Min.   :1740   Min.   :2019-11-27  
##  1st Qu.:2708525   1st Qu.:1763   1st Qu.:2019-11-29  
##  Median :2924700   Median :1776   Median :2019-12-02  
##  Mean   :2958233   Mean   :1779   Mean   :2019-12-01  
##  3rd Qu.:3292075   3rd Qu.:1796   3rd Qu.:2019-12-03  
##  Max.   :3925600   Max.   :1819   Max.   :2019-12-05

We now look at the price and volume trends using a candlechart. This is a very handy tool as it shows volatility everyday.For AMZN, we see that the highesy volatility in prices was on 6th December. The red bar means that the closing price was lower than the open price on that particular day.

candleChart(AMZN,theme = chartTheme("white",up.col='blue',dn.col='red'))

Sentiment Analysis

We use the sentiment analysis package to calculate sentiment scores for the Amazon tweets. The library uses four dictionaries, GI, HE, LM and QDAP. Each dictionary returns a positivity score, a negativity score, and the net score which is the difference between the two. However, after calculating column sums, we find that the GI and QDAP have picked up more words than the other two (The score is 0 for the other two whereas scores have been calculated for GI and QDAP).We decide to use GI for our analysis.

#run sentiment analysis
sentiment_amzn <- analyzeSentiment(amzn$text,
                            language = "english",
                            removeStopwords = TRUE, stemming = TRUE)
#calculate colSum to decide on dictionary
colSums(sentiment_amzn, na.rm=TRUE)

##          WordCount        SentimentGI       NegativityGI 
##        400645.0000          1508.1106          2189.3049 
##       PositivityGI        SentimentHE       NegativityHE 
##          3697.4155           271.4827           200.5177 
##       PositivityHE        SentimentLM       NegativityLM 
##           472.0004          -338.3169           905.8841 
##       PositivityLM RatioUncertaintyLM      SentimentQDAP 
##           567.5672           225.5134          1168.1036 
##     NegativityQDAP     PositivityQDAP 
##          1383.9282          2552.0318

let us build a new dataframe containing all the columns we will need now.

#Build the final dataframe for analysis
df_amzn<-cbind(amzn$Date, sentiment_amzn$WordCount, sentiment_amzn$SentimentGI)
df_amzn<-as.data.frame(df_amzn)
#remove na values
df_amzn <- df_amzn[complete.cases(df_amzn), ]
#rename columns after cbind
colnames(df_amzn)[1:3]<-c("Date", "Count", "MeanSentiment")

#Frequency of sentiment words by date
count_amzn<-df_amzn%>%
  group_by(Date)%>%
  summarise(Count = sum(Count))

#Mean sentiments by date
mean_amzn<-df_amzn%>%
  group_by(Date)%>%
  summarise(Mean = mean(MeanSentiment))

Visualizing the variables together

In order to plot both variables on the same chart, we need to calculate the returns of the stock prices. We have already calculated that in an excel and will import the csv.

ret<-read.csv("https://raw.githubusercontent.com/zahirf/Data607/master/returns.csv", stringsAsFactors = FALSE)
mean_amzn<-cbind(mean_amzn,ret$AMZN, count_amzn$Count)
colnames(mean_amzn)[3]<-("Returns")
colnames(mean_amzn)[4]<-("Count")

After plotting the variables,we do not see any clear trends between the stock returns and the other two variables

plot1 <- ggplot(data=mean_amzn, aes(x=as.Date(Date),y=Returns, group=1)) +
  geom_line()+
  geom_point() +
  ylab("Closing stock price")+
  xlab("Date")
 

plot2 <- ggplot(data=mean_amzn, aes(x=as.Date(Date),y=Mean, group=1)) +
  geom_line()+
  geom_point() +
  ylab("Mean Score")+
  xlab("Date")

plot3<- ggplot(data=mean_amzn, aes(x=as.Date(Date),y=Count, group=1)) +
  geom_line()+
  geom_point() +
  ylab("No of sentiments")+
  xlab("Date")
  
  
grid.arrange(plot1, plot2,plot3, nrow=3)

Model Data

We will be running a linear regression model to estimate if we can use the sentiment scores to predict returns of AMZN.

We see that the Adjusted Rsquare is very low and p values are high.

reg<-lm(Returns~Mean, mean_amzn)
summary(reg)

## 
## Call:
## lm(formula = Returns ~ Mean, data = mean_amzn)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.008492 -0.007187 -0.003815  0.007185  0.015218 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01117    0.02243  -0.498    0.632
## Mean         0.14747    0.41450   0.356    0.731
## 
## Residual standard error: 0.01006 on 8 degrees of freedom
## Multiple R-squared:  0.01558,    Adjusted R-squared:  -0.1075 
## F-statistic: 0.1266 on 1 and 8 DF,  p-value: 0.7312

cor(mean_amzn$Returns,mean_amzn$Mean)

## [1] 0.1248046

ggplot(mean_amzn, aes(x=Mean, y=Returns))+
  geom_point()+
  geom_smooth(method='lm')

Let us run the same regression of frequency of sentiments. Again we have low p values and low RSquare

reg1<-lm(Returns~Count, mean_amzn)
summary(reg1)

## 
## Call:
## lm(formula = Returns ~ Count, data = mean_amzn)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.010675 -0.006610 -0.001305  0.008764  0.011919 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.467e-03  6.803e-03   0.510    0.624
## Count       -1.682e-07  1.526e-07  -1.103    0.302
## 
## Residual standard error: 0.009444 on 8 degrees of freedom
## Multiple R-squared:  0.1319, Adjusted R-squared:  0.02341 
## F-statistic: 1.216 on 1 and 8 DF,  p-value: 0.3023

cor(mean_amzn$Returns,mean_amzn$Count)

## [1] -0.3632094

ggplot(mean_amzn, aes(x=Count, y=Returns))+
  geom_point()+
  geom_smooth(method='lm')

Testing on Apple

Let us now visualize the relationship between these variables on a second stock. The data has been collected and saved to a csv file for the sake of memory on the local computer. The code can be found here http://rpubs.com/zahirf/557769

Again there is no clear trend.

apple<-read.csv("https://raw.githubusercontent.com/zahirf/Data607/master/data_aapl.csv", stringsAsFactors = FALSE)
glimpse(apple)

## Observations: 10
## Variables: 4
## $ Date    <int> 18227, 18228, 18229, 18230, 18231, 18232, 18233, 18234...
## $ Mean    <dbl> 0.07769088, 0.07072368, 0.06571325, 0.06974535, 0.0698...
## $ Returns <dbl> 0.010758164, 0.010758164, -0.010706638, -0.010706638, ...
## $ Count   <int> 4353, 27429, 37362, 15906, 22338, 67002, 65028, 59630,...

apple$Date<-as.Date(apple$Date)
plot4 <- ggplot(data=apple, aes(x=as.Date(Date),y=Returns, group=1)) +
  geom_line()+
  geom_point() +
  ylab("Stock Returns")+
  xlab("Date")
 
plot5 <- ggplot(data=apple, aes(x=as.Date(Date),y=Mean, group=1)) +
  geom_line()+
  geom_point() +
  ylab("Mean Scores")+
  xlab("Date")

plot6<- ggplot(data=count_amzn, aes(x=as.Date(Date),y=Count, group=1)) +
  geom_line()+
  geom_point() +
  ylab("No of sentiments")+
  xlab("Date")

grid.arrange(plot4, plot5,plot6, nrow=3)

Let us run a linear model on this. We see a higher correlation of around 20% but not enough to confirm a model that can satisfactorily predict stock returns.

reg2<-lm(Returns~Mean, apple)
summary(reg2)

## 
## Call:
## lm(formula = Returns ~ Mean, data = apple)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.020465 -0.009864  0.002970  0.008305  0.019198 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01430    0.01995  -0.717    0.494
## Mean         0.19577    0.32670   0.599    0.566
## 
## Residual standard error: 0.0129 on 8 degrees of freedom
## Multiple R-squared:  0.04296,    Adjusted R-squared:  -0.07667 
## F-statistic: 0.3591 on 1 and 8 DF,  p-value: 0.5656

cor(apple$Returns,apple$Mean)

## [1] 0.207262

ggplot(apple, aes(x=Mean, y=Returns))+
  geom_point()+
  geom_smooth(method='lm')

Let us test the model with Count of sentiments. Again the p value is very high

reg3<-lm(Returns~Count, apple)
summary(reg3)

## 
## Call:
## lm(formula = Returns ~ Count, data = apple)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.022007 -0.008569  0.000511  0.009926  0.018775 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.453e-03  8.668e-03  -0.168    0.871
## Count       -2.626e-08  1.749e-07  -0.150    0.884
## 
## Residual standard error: 0.01316 on 8 degrees of freedom
## Multiple R-squared:  0.002811,   Adjusted R-squared:  -0.1218 
## F-statistic: 0.02255 on 1 and 8 DF,  p-value: 0.8843

cor(apple$Returns,apple$Count)

## [1] -0.05301883

ggplot(apple, aes(x=Mean, y=Count))+
  geom_point()+
  geom_smooth(method='lm')

Testing on Microsoft

Codes to create the csv file are here http://rpubs.com/zahirf/557775

We are not visualizing here and only running the tests. The results are shown in conclusions.

microsoft<-read.csv("https://raw.githubusercontent.com/zahirf/Data607/master/data_msft.csv", stringsAsFactors = FALSE)
reg4<-lm(Returns~Mean, microsoft)
reg5<-lm(Returns~Count, microsoft)

Conclusion and Limitations

Let us summarize the findings so far:

Using stock returns and mean sentiment scores, we see that both the R squares are very low and P values are very high. The variables do not have a statistically significant relationship.

meanSentiment<-data.frame("Ticker"=c("AMZN","AAPL", "MSFT"),
                          "RSquare"=c(summary(reg)$r.squared,summary(reg2)$r.squared,
                                      summary(reg4)$r.squared),
                          "PValue"=c(anova(reg)$'Pr(>F)'[1],anova(reg2)$'Pr(>F)'[1],
                                     anova(reg4)$'Pr(>F)'[1]))
meanSentiment

##   Ticker     RSquare    PValue
## 1   AMZN 0.015576185 0.7312028
## 2   AAPL 0.042957536 0.5655938
## 3   MSFT 0.009851615 0.7850054

Using the Count of sentiments,we see the statistics imporving a little bit for MSFT but both are still statistically insignificant for a concrete model.

countSentiment<-data.frame("Ticker"=c("AMZN","AAPL", "MSFT"),
                           "RSquare"=c(summary(reg1)$r.squared,summary(reg3)$r.squared,
                                       summary(reg5)$r.squared),
                           "PValue"=c(anova(reg1)$'Pr(>F)'[1],anova(reg3)$'Pr(>F)'[1],
                                      anova(reg5)$'Pr(>F)'[1]))
countSentiment

##   Ticker     RSquare     PValue
## 1   AMZN 0.131921092 0.30225767
## 2   AAPL 0.002810996 0.88434678
## 3   MSFT 0.346724401 0.07329421

We may therefore conclude that neither twitter sentiment scores nor counts of sentiments are variables that may be used to predict stock returns in the short run.

However, the data set was very limited to only 10 days and it will be interesting to do the same analysis with a sample of at least 30 returns and a coverage of more stocks in the same sector.