This project looks at the relationship between daily stock returns and mean twitter sentiment scores.There are lots of research article on the web about the relationship between social media sentiments and stock prices. In this project, we have tried to identify whether there is any relationship between daily returns for three tech stocks and the twitter sentiments on a dialy basis. The stocks we have chosen are Amazon, Apple and Microsoft, all tech stocks. Due to the limitations of the Twitter API, we could only get tweets for the last 10 days and so we have used the closing prices of those days for this project.
The motivation behind this project is to find out whether social media sentiments really do affect stock prices in the short term. If it does, this will be a way for day traders to predict prices over the short term. The research questions we will be answering are:
This project follows the OSEMN workflow. It is detailed as follows:
Tweets about AMZN, AAPL and MSFT are obtained through the free Twitter API. We had to register for a developer account and create an app for this. 30,106 tweets for AMZN and 31,620 tweet for AAPL were collected. The dates used were from 27th Nov 2019 to 6th Dec 2019. The twitter data were saved as csv for each stock.
The closing price data was collected using the R package quantmod. This package returns Open, High, Low, Close, Volume and Adjusted Closing price. We have used the Adjusted closing price in our return calculation as it takes care of outliers during the end of the trading day.
It should be mentioned that there only 6 trading days during the period under question and we have limited our analysis to those days.
The tweets were cleaned using stringr. The dates had to be converted for R to be able to read it.
We did exploratory data analysis on the stock prices to check for volatility. The tweets data had to be checked so we could further clean it up for the sentiment analysis part.
The prices were then plotted with sentiment scores and also frequency to check for relationships.
Linear regression models have been used to explore whether the sentiment scores can be used to predict stock prices.
The findings are then outlined in the conclusion section.
To know more about OSEMN, visit (https://machinelearningmastery.com/how-to-work-through-a-problem-like-a-data-scientist/)
There were few challenges faced during the project
Limited Twitter API data: I had access to only 10 days of tweets using the free API. I had to retry on limit so I could get enough observations and that took some time.
Memory Overflow: Initially I wanted to keep all code on the same Rmd file but later decided to save the AAPL and MSFT findings as CSV and in a different file so I could knit (and submit on time!)
I have used cbind several times and had to keep renaming the variables
Had to read in dates as as.Date again everytime I loaded a csv
The following R libraries have been used throughout the project
library(tidyverse)
library(ggplot2)
library(stringr)
library(rtweet)
library(quantmod)
library(SentimentAnalysis)
require(gridExtra)
library(data.table)
First a connection is created with the twitter API using rtweet
##Creating the connection with rtweet
create_token(
app = "607",
consumer_key = "***********",
consumer_secret = "***********************88",
access_token = "480323958-6kQk1ST8tm0CA2qLShYtnozd2Y3rJSsePg5ejqci",
access_secret = "*************************"
)
Then we obtain the Amazon twitter data.We created a loop that iterates through each day to collect tweets, so that out dataset is not dominated by recent tweets. We also use retry on limit as the API has limitations on number of requests for the free version. We have excluded all retweets for this dataset. (https://developer.twitter.com/en/docs/basics/rate-limits)
We get about 90 different variables in the twitter dataset. We will only be using the Date and text comuns for our analysis.
We want to store this dataset as a csv so we can read it in for analysis later. For this purpose, we encode the text column to utf and convert the date using as.date.
#dates variable created to increase by 1, we will use this so we can get an adequate number of tweets for all days. If we do not do this step, the data is dominated by recent tweets
dates <- seq.Date(from = as.Date("2019-12-01"), to = as.Date("2019-12-06"), by = 1)
#Empty dataframe created
df_amzn <- data.frame()
#using loop to get the data using a temporary dataframe, binds to amazon dataframe
for (i in seq_along(dates)) {
df_temp<-search_tweets("@$amzn OR @AMZNNews", n =15000,
lang = 'en', include_rts = FALSE,
retryonratelimit = TRUE)
df_amzn <- rbind(df_amzn, df_temp)
}
amzn<-df_amzn
#Encode to utf8
amzn$text <- enc2utf8(amzn$text)
#Extract the date and convert using as.date in format "%m/%d/%y. Since we will compare daily data, we do not need the hour and mins part.
date <- amzn$created_at
date <- str_extract(date, "\\d{4}-\\d{2}-\\d{2}")
date <- as.Date(date)
date <- as.Date(date, format ="%m/%d/%y")
amzn$Date <- date
#Keep only the columns we need
amzn<-subset(amzn, select=c(Date,text))
#Write as csv file, this has been upload at
write.csv(amzn, file ="amzn.csv", row.names = FALSE, fileEncoding="UTF-8")
In order to get the stock price data, we use quantmod and use yahoo as the data source
#setting dates on which we need the prices
start <- as.Date("2019-11-27")
end <- as.Date("2019-12-06")
getSymbols("AMZN", src = "yahoo", from = start, to = end)
## [1] "AMZN"
price_amzn<-as.data.frame(AMZN)
#Extracing date
price_amzn$Date<-as.Date(index(AMZN))
There is a lot of cleanup required in the text column to make it ready for the sentiment analysis package. We cleaned up whitespave, removed punctuation,removed $AMZN which is the handle we used to acquire tweets about the stock, removed emojis, and removed URL.We take a glimpse at the corpus and find it to be much cleaner.
Cleaning up Amazon Data
#read the file into R
amzn<-read.csv('https://raw.githubusercontent.com/zahirf/Data607/master/amzn.csv', stringsAsFactors = FALSE)
class(amzn$Date)
## [1] "character"
#date is read in as character, so we convert
amzn$Date<-as.Date(amzn$Date)
#clean up the text
amzn$text <- gsub("http.*", "", amzn$text)#remove url
amzn$text <- gsub("https.*", "", amzn$text)#remove url
amzn$text <- gsub("&", "&", amzn$text) #remove &
amzn$text <- gsub("$AMZN ", "", amzn$text)#remove handle
amzn$text <- gsub("^[[:space:]]*","",amzn$text) # Remove leading whitespaces
amzn$text <- gsub(" +"," ",amzn$text) #Remove extra whitespaces
amzn$text <- iconv(amzn$text, "latin1", "ASCII", sub="") # Remove emojis
amzn$text <- gsub("\\n", "", amzn$text) #Replace line breaks with ""
amzn$text <- gsub("[[:punct:]]","",amzn$text) # Remove punctuation
amzn$text <- gsub("^[0-9]*$","",amzn$text) # Remove punctuation
glimpse(amzn$text)
## chr [1:30106] "Amazon AMZNAmazon bullish for mondayLong or short it on WCX " ...
Stock Price
We would like to do some explatory analysis on the stock price data and get an idea about the volatility during the period of research.
Let us look at the summary of the AMZN stock price. We see that the mean of adjusted closing price was 1779, the the price varies from 1819 to 1740.That is about a 3% decline.
summary(price_amzn)
## AMZN.Open AMZN.High AMZN.Low AMZN.Close
## Min. :1760 Min. :1764 Min. :1740 Min. :1740
## 1st Qu.:1766 1st Qu.:1777 1st Qu.:1750 1st Qu.:1763
## Median :1788 Median :1797 Median :1761 Median :1776
## Mean :1787 Mean :1797 Mean :1768 Mean :1779
## 3rd Qu.:1804 3rd Qu.:1820 3rd Qu.:1789 3rd Qu.:1796
## Max. :1818 Max. :1825 Max. :1801 Max. :1819
## AMZN.Volume AMZN.Adjusted Date
## Min. :1923400 Min. :1740 Min. :2019-11-27
## 1st Qu.:2708525 1st Qu.:1763 1st Qu.:2019-11-29
## Median :2924700 Median :1776 Median :2019-12-02
## Mean :2958233 Mean :1779 Mean :2019-12-01
## 3rd Qu.:3292075 3rd Qu.:1796 3rd Qu.:2019-12-03
## Max. :3925600 Max. :1819 Max. :2019-12-05
We now look at the price and volume trends using a candlechart. This is a very handy tool as it shows volatility everyday.For AMZN, we see that the highesy volatility in prices was on 6th December. The red bar means that the closing price was lower than the open price on that particular day.
candleChart(AMZN,theme = chartTheme("white",up.col='blue',dn.col='red'))
Sentiment Analysis
We use the sentiment analysis package to calculate sentiment scores for the Amazon tweets. The library uses four dictionaries, GI, HE, LM and QDAP. Each dictionary returns a positivity score, a negativity score, and the net score which is the difference between the two. However, after calculating column sums, we find that the GI and QDAP have picked up more words than the other two (The score is 0 for the other two whereas scores have been calculated for GI and QDAP).We decide to use GI for our analysis.
#run sentiment analysis
sentiment_amzn <- analyzeSentiment(amzn$text,
language = "english",
removeStopwords = TRUE, stemming = TRUE)
#calculate colSum to decide on dictionary
colSums(sentiment_amzn, na.rm=TRUE)
## WordCount SentimentGI NegativityGI
## 400645.0000 1508.1106 2189.3049
## PositivityGI SentimentHE NegativityHE
## 3697.4155 271.4827 200.5177
## PositivityHE SentimentLM NegativityLM
## 472.0004 -338.3169 905.8841
## PositivityLM RatioUncertaintyLM SentimentQDAP
## 567.5672 225.5134 1168.1036
## NegativityQDAP PositivityQDAP
## 1383.9282 2552.0318
let us build a new dataframe containing all the columns we will need now.
#Build the final dataframe for analysis
df_amzn<-cbind(amzn$Date, sentiment_amzn$WordCount, sentiment_amzn$SentimentGI)
df_amzn<-as.data.frame(df_amzn)
#remove na values
df_amzn <- df_amzn[complete.cases(df_amzn), ]
#rename columns after cbind
colnames(df_amzn)[1:3]<-c("Date", "Count", "MeanSentiment")
#Frequency of sentiment words by date
count_amzn<-df_amzn%>%
group_by(Date)%>%
summarise(Count = sum(Count))
#Mean sentiments by date
mean_amzn<-df_amzn%>%
group_by(Date)%>%
summarise(Mean = mean(MeanSentiment))
Visualizing the variables together
In order to plot both variables on the same chart, we need to calculate the returns of the stock prices. We have already calculated that in an excel and will import the csv.
ret<-read.csv("https://raw.githubusercontent.com/zahirf/Data607/master/returns.csv", stringsAsFactors = FALSE)
mean_amzn<-cbind(mean_amzn,ret$AMZN, count_amzn$Count)
colnames(mean_amzn)[3]<-("Returns")
colnames(mean_amzn)[4]<-("Count")
After plotting the variables,we do not see any clear trends between the stock returns and the other two variables
plot1 <- ggplot(data=mean_amzn, aes(x=as.Date(Date),y=Returns, group=1)) +
geom_line()+
geom_point() +
ylab("Closing stock price")+
xlab("Date")
plot2 <- ggplot(data=mean_amzn, aes(x=as.Date(Date),y=Mean, group=1)) +
geom_line()+
geom_point() +
ylab("Mean Score")+
xlab("Date")
plot3<- ggplot(data=mean_amzn, aes(x=as.Date(Date),y=Count, group=1)) +
geom_line()+
geom_point() +
ylab("No of sentiments")+
xlab("Date")
grid.arrange(plot1, plot2,plot3, nrow=3)
We will be running a linear regression model to estimate if we can use the sentiment scores to predict returns of AMZN.
We see that the Adjusted Rsquare is very low and p values are high.
reg<-lm(Returns~Mean, mean_amzn)
summary(reg)
##
## Call:
## lm(formula = Returns ~ Mean, data = mean_amzn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.008492 -0.007187 -0.003815 0.007185 0.015218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01117 0.02243 -0.498 0.632
## Mean 0.14747 0.41450 0.356 0.731
##
## Residual standard error: 0.01006 on 8 degrees of freedom
## Multiple R-squared: 0.01558, Adjusted R-squared: -0.1075
## F-statistic: 0.1266 on 1 and 8 DF, p-value: 0.7312
cor(mean_amzn$Returns,mean_amzn$Mean)
## [1] 0.1248046
ggplot(mean_amzn, aes(x=Mean, y=Returns))+
geom_point()+
geom_smooth(method='lm')
Let us run the same regression of frequency of sentiments. Again we have low p values and low RSquare
reg1<-lm(Returns~Count, mean_amzn)
summary(reg1)
##
## Call:
## lm(formula = Returns ~ Count, data = mean_amzn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.010675 -0.006610 -0.001305 0.008764 0.011919
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.467e-03 6.803e-03 0.510 0.624
## Count -1.682e-07 1.526e-07 -1.103 0.302
##
## Residual standard error: 0.009444 on 8 degrees of freedom
## Multiple R-squared: 0.1319, Adjusted R-squared: 0.02341
## F-statistic: 1.216 on 1 and 8 DF, p-value: 0.3023
cor(mean_amzn$Returns,mean_amzn$Count)
## [1] -0.3632094
ggplot(mean_amzn, aes(x=Count, y=Returns))+
geom_point()+
geom_smooth(method='lm')
Let us now visualize the relationship between these variables on a second stock. The data has been collected and saved to a csv file for the sake of memory on the local computer. The code can be found here http://rpubs.com/zahirf/557769
Again there is no clear trend.
apple<-read.csv("https://raw.githubusercontent.com/zahirf/Data607/master/data_aapl.csv", stringsAsFactors = FALSE)
glimpse(apple)
## Observations: 10
## Variables: 4
## $ Date <int> 18227, 18228, 18229, 18230, 18231, 18232, 18233, 18234...
## $ Mean <dbl> 0.07769088, 0.07072368, 0.06571325, 0.06974535, 0.0698...
## $ Returns <dbl> 0.010758164, 0.010758164, -0.010706638, -0.010706638, ...
## $ Count <int> 4353, 27429, 37362, 15906, 22338, 67002, 65028, 59630,...
apple$Date<-as.Date(apple$Date)
plot4 <- ggplot(data=apple, aes(x=as.Date(Date),y=Returns, group=1)) +
geom_line()+
geom_point() +
ylab("Stock Returns")+
xlab("Date")
plot5 <- ggplot(data=apple, aes(x=as.Date(Date),y=Mean, group=1)) +
geom_line()+
geom_point() +
ylab("Mean Scores")+
xlab("Date")
plot6<- ggplot(data=count_amzn, aes(x=as.Date(Date),y=Count, group=1)) +
geom_line()+
geom_point() +
ylab("No of sentiments")+
xlab("Date")
grid.arrange(plot4, plot5,plot6, nrow=3)
Let us run a linear model on this. We see a higher correlation of around 20% but not enough to confirm a model that can satisfactorily predict stock returns.
reg2<-lm(Returns~Mean, apple)
summary(reg2)
##
## Call:
## lm(formula = Returns ~ Mean, data = apple)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.020465 -0.009864 0.002970 0.008305 0.019198
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01430 0.01995 -0.717 0.494
## Mean 0.19577 0.32670 0.599 0.566
##
## Residual standard error: 0.0129 on 8 degrees of freedom
## Multiple R-squared: 0.04296, Adjusted R-squared: -0.07667
## F-statistic: 0.3591 on 1 and 8 DF, p-value: 0.5656
cor(apple$Returns,apple$Mean)
## [1] 0.207262
ggplot(apple, aes(x=Mean, y=Returns))+
geom_point()+
geom_smooth(method='lm')
Let us test the model with Count of sentiments. Again the p value is very high
reg3<-lm(Returns~Count, apple)
summary(reg3)
##
## Call:
## lm(formula = Returns ~ Count, data = apple)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.022007 -0.008569 0.000511 0.009926 0.018775
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.453e-03 8.668e-03 -0.168 0.871
## Count -2.626e-08 1.749e-07 -0.150 0.884
##
## Residual standard error: 0.01316 on 8 degrees of freedom
## Multiple R-squared: 0.002811, Adjusted R-squared: -0.1218
## F-statistic: 0.02255 on 1 and 8 DF, p-value: 0.8843
cor(apple$Returns,apple$Count)
## [1] -0.05301883
ggplot(apple, aes(x=Mean, y=Count))+
geom_point()+
geom_smooth(method='lm')
Codes to create the csv file are here http://rpubs.com/zahirf/557775
We are not visualizing here and only running the tests. The results are shown in conclusions.
microsoft<-read.csv("https://raw.githubusercontent.com/zahirf/Data607/master/data_msft.csv", stringsAsFactors = FALSE)
reg4<-lm(Returns~Mean, microsoft)
reg5<-lm(Returns~Count, microsoft)
Let us summarize the findings so far:
Using stock returns and mean sentiment scores, we see that both the R squares are very low and P values are very high. The variables do not have a statistically significant relationship.
meanSentiment<-data.frame("Ticker"=c("AMZN","AAPL", "MSFT"),
"RSquare"=c(summary(reg)$r.squared,summary(reg2)$r.squared,
summary(reg4)$r.squared),
"PValue"=c(anova(reg)$'Pr(>F)'[1],anova(reg2)$'Pr(>F)'[1],
anova(reg4)$'Pr(>F)'[1]))
meanSentiment
## Ticker RSquare PValue
## 1 AMZN 0.015576185 0.7312028
## 2 AAPL 0.042957536 0.5655938
## 3 MSFT 0.009851615 0.7850054
Using the Count of sentiments,we see the statistics imporving a little bit for MSFT but both are still statistically insignificant for a concrete model.
countSentiment<-data.frame("Ticker"=c("AMZN","AAPL", "MSFT"),
"RSquare"=c(summary(reg1)$r.squared,summary(reg3)$r.squared,
summary(reg5)$r.squared),
"PValue"=c(anova(reg1)$'Pr(>F)'[1],anova(reg3)$'Pr(>F)'[1],
anova(reg5)$'Pr(>F)'[1]))
countSentiment
## Ticker RSquare PValue
## 1 AMZN 0.131921092 0.30225767
## 2 AAPL 0.002810996 0.88434678
## 3 MSFT 0.346724401 0.07329421
We may therefore conclude that neither twitter sentiment scores nor counts of sentiments are variables that may be used to predict stock returns in the short run.
However, the data set was very limited to only 10 days and it will be interesting to do the same analysis with a sample of at least 30 returns and a coverage of more stocks in the same sector.