Introduction

Research Question

Our research question is as follows: can sentiment analysis of relevant tweets be a predictor for market movement in Bitcoin, measured with a variety of metrics including price change and volatility? Applications of our results would help predict movement in other asset prices based on their mentions on Twitter or other microblogging platforms.

Literature Review

There is a signficant amount of literature scraping text and performing sentiment analysis on it, using the results to predict market movement in an asset price. This literature uses a number of text sources, like newspapers, Twitter, press releases, and other microblogging or periodical data, as well as a span of assets to price, including house prices, stock index prices and individual asset prices. However, since there exists a small literature on our specific question, namely sentiment analysis of Tweet data to predict Bitcoin prices, that is what we focused on.

We selected three recent studies of this question, and will briefly discuss them below. Kaminski uses a Twitter API to scrape Bitcoin related Tweets in 2013-2014 and performs sentiment analysis on them to predict Bitcoin prices, collected on a daily basis. They found a signficant relationship between negative Tweets and lower bitcoin close prices, and high trading volume and high emotion, but they found that the fluctuations in Bitcoin preceded Twitter sentiment by approximately one day. Ultimately they conclude that Bitcoin market movement is reflected emotionally on Twitter the day after the movement. They also acknowledge that their data spanned a mere 104 days, which could be insufficient for this analysis. We also find their sentiment analysis problematic as it uses classically positive and negative words to determine sentiment, like “happy, good, and love” and “bad, upset and unhappy.” While they do have a category of tweets that contain “hope, fear, or worry,” we think that these emotional indicators of sentiment are not good reflectors of bullish or bearish tendencies on an asset, which are better reflected by words like “performance, buy, upswing, and rally” or “underperform, sell, plummet, and collapse.”

Stenqvist and Lönnö classify Twitter sentiment on a positive/negative binary and attempt to use their results to predict short term Bitcoin price increase/decrease over specific and short time intervals, disregarding volume traded and other metrics over 31 days from May to June 2017. They use VADER sentiment analysis like our dataset, recording polarity and intensity of emotion in each tweet. However, they conclude that though there may be a partial correlation between Tweets and Bitcoin prices in certain subsets of data, overall there is no significant correlation - their results show a prediction accuracy near 50% on a binary output variable of price up or down. However, they admit that their research design suffers from a lack of data and a lack of domain specific sentiment analysis.

McNally, Roche and Caton analyze Bitcoin price data with a 5 and 10 day simple moving average measure from August 2013 to July 2016 on two nonlinear deep learning methods, recurrent neural network (RNN) and Long Short Term Memory (LSTM). Though the deep learning models outperform the others, they still are not good predictors as the variance of the output variable makes prediction difficult.

Research Design

Research strategy

We found our dataset on Kaggle with information on hourly Bitcoin market values and Twitter Bitcoin sentiment. The sentiment analysis was previously run and the Bitcoin data was aggregated by a third party. We ran analysis to select correlating variables and transformed some significant data by converting their raw values into percentage measures. From there we intend to regress Bitcoin market data on the Twitter sentiment data to assess any correlations or patterns that may exist in significant variables in either of those categories. We considered the models discussed in class to analyze our data and found linear regression to be the best model that fit our available data.

Limitations

The Kaggle dataset, insofar as a third party collected it, came with a lot of limitations for our analysis. First, the sentiment analysis was previously performed using the VADER method, which helpfully collects data not only on the polarity of the text in question (positive or negative), but also the intensity of the sentiment (great>good). However, it analyzes text on a generic idea of positive and negative words rather than a domain specific vocabulary, which is problematic in this kind of price analysis. For example, although “happy” reads as a positive Tweet, someone who is bullish on Bitcoin is much more likely to use the words “buy” or “long” or “hold,” none of which would read positive using VADER methods for sentiment analysis. Secondly, the Volume data in this dataset for amount of Bitcoin traded on an hourly basis was either wrong or unreadable, which we were able to discover by checking it against Bitcoin historical trading data. In addition, there were some stretches of multiple hours where Twitter data was unavailable for no known reason.

Describe the Data

Load the Data

allData <- read.csv(file = "Data_To_Hourervals_no_filter.csv",
                    header = TRUE, sep = ";", strip.white = TRUE)
allData$Date <- dmy_hm(allData$Date)

Bitcoin Variables

We had Bitcoin data on Open, High, Low, and Close values for Bitcoin every hour in the length of our dataset. We also had data on Bitcoin volume in both BTC and dollars, but because we found this data to be unreadable or incorrect, we decided not to use it in our analysis.

Twitter variables

We had data on number of positive, negative, neutral and total Tweets about Bitcoin within each hour, as well as the average negative sentiment intensity of each hour’s negative Tweets and the average positive sentiment intensity of each hour’s positive Tweets. There was also a Compound Score variable that aggregated all sentiment of Tweets in a given hour and divided it by the total number of Tweets, but this value was not very useful as it was often very small and diluted due to the large number of neutral Tweets.

Clean & Transform Data (First Iteration)

For use in linear regression analysis.

twitter <- select(allData, c("Date", "n", "Count_Negatives", "Count_Positives", 
                             "Count_Neutrals", "Sent_Negatives", "Sent_Positives", "Compound_Score"))

# Create additional variables to run analysis for twitter
twitter <- mutate(twitter, sentPercTotal = (((Count_Negatives * Sent_Negatives) +
                                               (Count_Positives * Sent_Positives)) / n)) %>%
  mutate(sentPerc = (((Count_Negatives * Sent_Negatives) + (Count_Positives * Sent_Positives)) /
                       (Count_Positives + Count_Negatives))) %>%
  mutate(Spread_Sent = Sent_Positives - Sent_Negatives)

# Create additional variables to run analysis for bitcoin
bitcoin <- select(allData, c("Date", "Open", "High", "Low", "Close"))
bitcoin <- mutate(bitcoin, Change = Close-Open) %>%
  mutate(Change_Sign = if_else(condition = Change>0, true = 1, false = 0)) %>%
  mutate(Perc_Change = (High/Low) - 1)

# Build first iteration of data
data <- full_join(twitter, bitcoin, by = "Date")
data1 <- data %>%
  na.omit()

# Remove unnecessary data sets
rm(twitter, bitcoin)

Additional Bitcoin Variables

We created additional variables to facilitate running our analysis. Change indicates the distance between the closing and opening value of Bitcoin for that hour. Change_Sign is a boolean that indicates 1 if Change is positive, and 0 if Change is negative, that is, if the Bitcoin price has gone up or down during that hour. Perc_Change shows the percentage difference between Bitcoin’s high and low price within a given hour.

Additional Twitter Variables

In sentPercTotal, we netted the total amount of sentiment in both the negative and positive Tweets from each hour and divided that sentiment by the total number of Tweets (positive, negative and neutral) which gave us an average net sentiment score. In sentPerc, we netted the total amount of sentiment in negative and positive Tweets from each hour and divided that sentiment by the total number of sentimental Tweets (positive and negative), which gave us an average net sentiment per sentimental Tweet score. This is just a more concentrated version of sentPercTotal, as it omits neutral Tweets. Spread_Sent measures the distance between the average positive and negative sentiments for each hour.

Note

Since not all Twitter data was available for each hour for unknown reasons, we omitted observations with NAs recorded in any of our variable columns.

Clean & Transform Data (Second Iteration)

For heatmap.

selectedData <- c("Perc_Change", "n", "Count_Negatives", "Count_Positives", "Count_Neutrals", "Sent_Negatives", "Sent_Positives", "Compound_Score", "sentPerc", "sentPercTotal", "Spread_Sent", "Change")

# Build second iteration of data
data2 <- select(data, selectedData) %>%
  na.omit()

Describe Dataset

Basic Correlary Analysis and Ploting

## Plot of change in bitcoin price
ggplot(data, aes(x = Date, y = Close)) +
  geom_path() +
  ylab("Closing Price")

## Plot of Price Volatility by Hour
ggplot(data, aes(x = Date, y = Change)) +
  geom_path() +
  ylab("Close$ - Open$")

ggplot(data, aes(x = Date, y = Perc_Change)) +
  geom_path() +
  ylab("Perc_Change = (High - Low) / 1")

ggplot(data, aes(x = Date, y = n)) +
  geom_path() +
  ylab("Tweet Volume")

## Plot of number of tweets
ggplot(data = data, aes(x = Date)) +
  geom_smooth(aes(y = Count_Neutrals, color = "Neutral")) +
  geom_smooth(aes(y = Count_Negatives, color = "Negative")) +
  geom_smooth(aes(y = Count_Positives, color = "Positive")) +
  labs(color = NULL, y = "Tweet Volume")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ggplot(data = data, aes(x = Date)) +
  geom_col(aes(y = Close, color = "Price")) +
  geom_col(aes(y = n, color = "Tweet Volume")) +
  labs(color = NULL)

These plots show Bitcoin price over time, Hourly Change (as described above), Perc_Change, Tweet Volume, Tweet Volume as Separated into Positive, Negative and Neutral Tweets, and Bitcoin Price overlaid with Bitcoin Tweet Volume respectively. The plot of Change shows that Bitcoin like many assets moves in a “random walk” pattern, where the asset price moves up and down almost randomly on a day-to-day or hour-to-hour basis. As you can see, there is a large spike in Tweets, Bitcoin price and volatility, and all three categories of sentimental Tweets around the time of the new year of 2018. Looking at our last plot showing Bitcoin Price and Tweet Volume might indicate a relationship between these two variables, as they both hae peaks near the end of 2017, tapering off through most of 2018, and fluctuating less during the rest of the dataset.

Plot heatmap of pair-wise correlations.

require(reshape2)

## Loading required package: reshape2

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

mat <- cor(data2)
meltedmat <- melt(mat)
heatmap1 <- ggplot(data = meltedmat, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() +
  scale_fill_gradient2(low = "red", high = "dodgerblue", mid = "white", midpoint = 0) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

heatmap1

The Bitcoin variables are highly correlated with each other and the Twitter variables are all highly correlated with each other, which makes sense, as high and low Bitcoin prices within an hour are unlikely to be vastly different. Similarly, the number of negative sentiment Tweets is correlated with the other numbers of Tweets, as higher volume of Tweets would indicate more negative Tweets.

What is interesting, however, is that Count_Negatives and Count_Neutrals have higher correlations with sentPerc and sentPerc total than the other Twitter variables, indicating that they may be significant for our models as we do our data analysis. Spread_Sent, which we initially thought would be indicative of volatility does not show strong relationships with any of the Bitcoin variables; it has a mild negative relationship with Count_Neutrals.

What this model really seeks to explain in Perc_Change, especially as Change does not have strong correlations with any variables, and the variables most highly correlated with Perc_Change are sentPerc, which seems to do a little better than sentPercTotal, as it is less diluted, and numerical Bitcoin data, specifically Count_Negatives and Count_Neutrals. N is highly correlated with Perc_Change as well, but we think the netural and negative data would capture a lot of n while separating and showing the individual effect of neutral and negative Tweets.

Results

Model and Feature Selection

Reprint pairwise correlation heatmap.

heatmap1

Run several linear regression models.

## Best model is mod1:
mod1 <- lm(Perc_Change ~ sentPerc + Count_Negatives + Count_Neutrals, data = data1)
summary(mod1)

## 
## Call:
## lm(formula = Perc_Change ~ sentPerc + Count_Negatives + Count_Neutrals, 
##     data = data1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.042958 -0.005446 -0.002120  0.002983  0.156754 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.699e-03  4.556e-04  12.509   <2e-16 ***
## sentPerc        -1.596e-02  1.730e-03  -9.228   <2e-16 ***
## Count_Negatives  2.174e-05  1.149e-06  18.921   <2e-16 ***
## Count_Neutrals   6.972e-06  5.235e-07  13.317   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.009813 on 12353 degrees of freedom
## Multiple R-squared:  0.2968, Adjusted R-squared:  0.2966 
## F-statistic:  1738 on 3 and 12353 DF,  p-value: < 2.2e-16

confint(mod1)

##                         2.5 %        97.5 %
## (Intercept)      4.806345e-03  6.592608e-03
## sentPerc        -1.935157e-02 -1.257058e-02
## Count_Negatives  1.948441e-05  2.398792e-05
## Count_Neutrals   5.945601e-06  7.998045e-06

coefficients(mod1)

##     (Intercept)        sentPerc Count_Negatives  Count_Neutrals 
##    5.699476e-03   -1.596108e-02    2.173617e-05    6.971823e-06

## Worse Models
mod2 <- lm(Perc_Change ~ sentPercTotal + n, data = data1)
summary(mod2)

## 
## Call:
## lm(formula = Perc_Change ~ sentPercTotal + n, data = data1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.035920 -0.005499 -0.002073  0.002969  0.157098 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.018e-03  3.857e-04   20.79   <2e-16 ***
## sentPercTotal -5.175e-02  2.368e-03  -21.86   <2e-16 ***
## n              7.236e-06  1.247e-07   58.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.009865 on 12354 degrees of freedom
## Multiple R-squared:  0.2892, Adjusted R-squared:  0.2891 
## F-statistic:  2513 on 2 and 12354 DF,  p-value: < 2.2e-16

mod3 <- lm(Perc_Change ~ sentPerc + n, data = data1)
summary(mod3)

## 
## Call:
## lm(formula = Perc_Change ~ sentPerc + n, data = data1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.036106 -0.005506 -0.002048  0.002991  0.157016 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.624e-03  4.023e-04   21.44   <2e-16 ***
## sentPerc    -3.147e-02  1.406e-03  -22.39   <2e-16 ***
## n            7.174e-06  1.251e-07   57.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.009856 on 12354 degrees of freedom
## Multiple R-squared:  0.2905, Adjusted R-squared:  0.2904 
## F-statistic:  2529 on 2 and 12354 DF,  p-value: < 2.2e-16

mod4 <- lm(Perc_Change ~ Spread_Sent, data = data1)
summary(mod4)

## 
## Call:
## lm(formula = Perc_Change ~ Spread_Sent, data = data1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.014437 -0.007322 -0.003475  0.003168  0.156678 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.031632   0.001798   17.59   <2e-16 ***
## Spread_Sent -0.021785   0.002018  -10.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01165 on 12355 degrees of freedom
## Multiple R-squared:  0.009343,   Adjusted R-squared:  0.009263 
## F-statistic: 116.5 on 1 and 12355 DF,  p-value: < 2.2e-16

mod5 <- lm(Perc_Change ~ Count_Neutrals + Count_Negatives, data = data1)
summary(mod5)

## 
## Call:
## lm(formula = Perc_Change ~ Count_Neutrals + Count_Negatives, 
##     data = data1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.047916 -0.005451 -0.002246  0.002984  0.156908 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.876e-03  1.901e-04   9.867   <2e-16 ***
## Count_Neutrals  5.064e-06  4.826e-07  10.493   <2e-16 ***
## Count_Negatives 2.810e-05  9.216e-07  30.495   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.009846 on 12354 degrees of freedom
## Multiple R-squared:  0.2919, Adjusted R-squared:  0.2918 
## F-statistic:  2547 on 2 and 12354 DF,  p-value: < 2.2e-16

Selected function: \(PercChange = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \varepsilon\) where \(X_1 = SentimentPercent\), \(X_2 = CountNegatives\), and \(X_3 = CountNeutrals\)

We chose a linear model, because we have a limited number of variables and our output variable, Perc_Change, is continuous. In addition, our transformation of Bitcoin price fluctuations to be measured in a percent value helps regularize the data, and it makes sense that the percentage change in sentiment would have a linear relationship to the percentage change in asset price for Bitcoin. We tested a few models that used the features that the heatmap indicated were most correlated with Perc_Change. Above are a few models that we tried using different combinations of the factors, which are sentPerc, sentPercTotal, n, Count_Negatives, Count_Neutrals, and Spread_Sent. The best performing model was a linear regression of sentPerc, Count_Negatives and Count_Neutrals against PercChange, which makes sense, as sentPerc is a more concentrated measure of sentiment than sentPercTotal, and n is explained in part by Count_Negatives and Count_Neutrals. It also makes sense in the real world, in that negative sentiment is more significant and attention-grabbing than positive sentiment, and the addition of neutral Tweets as a proxy for n is helpful as more Tweets about Bitcoin would indicate more chatter and discussion of the asset and could create price spikes, plummets, or general volatility, influencing percChange.

Linear Regression by Hour

Using selected function \(PercChange = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \varepsilon\).

Predict Test Data based on Trained Model 1. Train data is first half of timeline. Test on second half of data.

# Split into test and train data
set.seed(123)
train1 <- 1:(12357/2)
trainData1 <- data1[train1, ]
testData1 <- data1[-train1, ]

# Build linear regression Model 1 and review
trainedModel1 <- lm(Perc_Change ~ sentPerc + Count_Negatives + Count_Neutrals, trainData1)
summary(trainedModel1)

## 
## Call:
## lm(formula = Perc_Change ~ sentPerc + Count_Negatives + Count_Neutrals, 
##     data = trainData1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.040311 -0.006752 -0.002210  0.004003  0.154001 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.216e-03  7.103e-04  12.974  < 2e-16 ***
## sentPerc        -1.023e-02  3.088e-03  -3.314 0.000924 ***
## Count_Negatives  2.626e-05  1.603e-06  16.378  < 2e-16 ***
## Count_Neutrals   1.634e-06  7.968e-07   2.051 0.040353 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0114 on 6174 degrees of freedom
## Multiple R-squared:  0.238,  Adjusted R-squared:  0.2376 
## F-statistic: 642.8 on 3 and 6174 DF,  p-value: < 2.2e-16

# Obtain mean squared error for Model 1
testModel1 <- predict(trainedModel1, testData1[c("sentPerc", "Count_Negatives", "Count_Neutrals")])
testData1 <- mutate(testData1, Model_Value = testModel1*100) %>%
  mutate(Perc_Change = Perc_Change*100)

mse(actual = testData1$Perc_Change, predicted = testModel1)

## [1] 1.119214

ggplot(data = testData1, aes(x = Date)) +
  geom_path(aes(y = Perc_Change, color = "Actual")) +
  geom_path(aes(y = Model_Value, color = "Model")) +
  labs(color = NULL)

ggplot(data = testData1, aes(x = Date)) +
  geom_smooth(aes(y = Perc_Change, color = "Actual")) +
  geom_smooth(aes(y = Model_Value, color = "Model")) +
  labs(color = NULL)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

We used the linear regression line that performed the best from the previous question and used it to train and test a model above. We used the first half of the data for a training set and the second half as a test set. As our plots show, we were only able to capture a fraction of the variation in the data’s Perc_Change trends for Bitcoin price. This may be because the model is trained on the first half of the data, which contains that large spike in volatility and price near the end of 2017. For this reason, it is trained on volatile data and may not pick up on less intense fluctuations in sentiment and price. Our plots show this is true; in the last smooth plot, our model overestimates Perc_Change based on sentiment and underestimates the amount sentiment can affect over time. Though our p-value is quite low, rejecting our null hypothesis that there is no relationship between sentPerc, Count_Negatives and Count_Neutrals with respect to Perc_Change, our Adjusted R-squared value is 0.2376, meaning that our model only predicts 23.76% of movement in Perc_Change.

Predict Test Data based on Trained Model 2. Train data on random 80% of data. Test on remaining data.

# Split into test and train data
set.seed(400)
train2 <- sample(1:nrow(data1), size = nrow(data1)*.8)
trainData2 <- data1[train2, ]
testData2 <- data1[-train2, ]

# Build linear regression Model 2 and review
trainedModel2 <- lm(Perc_Change ~ sentPerc + Count_Negatives + Count_Neutrals, trainData2)
summary(trainedModel2)

## 
## Call:
## lm(formula = Perc_Change ~ sentPerc + Count_Negatives + Count_Neutrals, 
##     data = trainData2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.042158 -0.005432 -0.002071  0.002978  0.156789 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.802e-03  5.099e-04  11.378   <2e-16 ***
## sentPerc        -1.646e-02  1.928e-03  -8.536   <2e-16 ***
## Count_Negatives  2.079e-05  1.290e-06  16.120   <2e-16 ***
## Count_Neutrals   7.328e-06  5.820e-07  12.592   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.009786 on 9881 degrees of freedom
## Multiple R-squared:  0.295,  Adjusted R-squared:  0.2948 
## F-statistic:  1378 on 3 and 9881 DF,  p-value: < 2.2e-16

# Obtain mean squared error for Model 2
testModel2 <- predict(trainedModel2, testData2[c("sentPerc", "Count_Negatives", "Count_Neutrals")])
testData2 <- mutate(testData2, Model_Value = testModel2*100) %>%
  mutate(Perc_Change = Perc_Change*100)

mse(actual = testData2$Perc_Change, predicted = testModel2)

## [1] 2.837414

ggplot(data = testData2, aes(x = Date)) +
  geom_path(aes(y = Perc_Change, color = "Actual")) +
  geom_path(aes(y = Model_Value, color = "Model")) +
  labs(color = NULL)

ggplot(data = testData2, aes(x = Date)) +
  geom_smooth(aes(y = Perc_Change, color = "Actual")) +
  geom_smooth(aes(y = Model_Value, color = "Model")) +
  labs(color = NULL)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Training our model on a random 80% of our data and testing it on the remainng 20% of data leads to a much better fit. Not only does the model better reflect the actual Perc_Change magnitudes, it also better follows the trends and does not overestimate the actual values like the previous model. It still isn’t perfect, and the smooth line plot shows that the model is much smoother and does not reflect the actual Perc_Change values. The p - value is similarly low for this model but Adjusted R-Squared is a little higher at 0.2948, reflecting that this model better describes the data.

Linear Regression by Day

Aggregate all data by day

top <- head(allData$Date, n = 1)
bottom <- tail(allData$Date, n = 1)
intrvl <- seq(date(top), date(bottom), by = as.difftime(days(1)))


buildMatrix <- matrix(nrow = length(intrvl), ncol = 11)
colnames(buildMatrix) <- c("Date","n", "Count_Negatives", "Count_Positives", "Count_Neutrals", 
                           "Sent_Negatives", "Sent_Positives", "Open", "High", "Low", "Close")
sapply(buildMatrix[2], as.numeric)

## [1] NA

# Build agregated dataframe
for (d in 1:length(intrvl)) {
  A <- filter(allData, date(Date) == intrvl[d]) %>%
    na.omit()
  buildMatrix[d, 1] <- toString(intrvl[d])
  buildMatrix[d, 2] <- sum(A$n)
  buildMatrix[d, 3] <- sum(A$Count_Negatives)
  buildMatrix[d, 4] <- sum(A$Count_Positives)
  buildMatrix[d, 5] <- sum(A$Count_Neutrals)
  
  # aggragation of sentiment
  value1 <- 0
  value2 <- 0
  for (j in 1:24) {
    value1 <- value1 +  (A[j, 5] * A[j, 8])
    value2 <- value2 + (A[j, 4] * A[j, 7])
  }
  buildMatrix[d, 7] <- value1/sum(A$Count_Positives)
  buildMatrix[d, 6] <- value2/sum(A$Count_Negatives)
  
  B <- filter(allData, date(Date) == intrvl[d])
  buildMatrix[d, 11] <- B[24, 12]
  buildMatrix[d, 10] <- min(B[, 11])
  buildMatrix[d, 9] <- max(B[, 10])
  buildMatrix[d, 8] <- B[1, 9]
  
}

# Finishing touches to clean data
dataDays <- select(as.data.frame(buildMatrix, stringsAsFactors = FALSE), 
                   c("Date","n", "Count_Negatives", "Count_Positives", 
                     "Count_Neutrals", "Sent_Negatives", "Sent_Positives", 
                     "Open", "High", "Low", "Close"))
toNumeric <- c("n", "Count_Negatives", "Count_Positives", "Count_Neutrals", 
               "Sent_Negatives", "Sent_Positives", "Open", "High", "Low", "Close")
options(digits = 8)
dataDays[toNumeric] <- sapply(dataDays[toNumeric], as.double)
dataDays$Date <- ymd(dataDays$Date)

This section of code aggregates sentiment and Bitcoin price so we can analyze them on a daily level rather than on an hourly basis.

Rough analysis of aggregated variables

data3 <- select(dataDays, c("n", "Count_Negatives", "Count_Positives", "Count_Neutrals", "Sent_Negatives", 
                            "Sent_Positives", "Open", "High", "Low", "Close")) %>%
  na.omit() %>%
  mutate(Change = Close-Open) %>%
  mutate(Perc_Change = (High/Low)-1) %>%
  mutate(sentPerc = (((Count_Negatives*Sent_Negatives)+(Count_Positives*Sent_Positives))/
                      (Count_Positives + Count_Negatives))) %>%
  select(c("Perc_Change", "n", "Count_Negatives", "Count_Positives", "Count_Neutrals", 
           "Sent_Negatives", "Sent_Positives", "sentPerc", "Change"))
  
cor(data3)

##                 Perc_Change            n Count_Negatives Count_Positives
## Perc_Change      1.00000000  0.644569321    0.6876803152      0.60507878
## n                0.64456932  1.000000000    0.9504117978      0.98230058
## Count_Negatives  0.68768032  0.950411798    1.0000000000      0.92762198
## Count_Positives  0.60507878  0.982300581    0.9276219817      1.00000000
## Count_Neutrals   0.61548208  0.978979215    0.8873335748      0.93703294
## Sent_Negatives  -0.10654966 -0.135178821   -0.1984481855     -0.15224680
## Sent_Positives  -0.29192894 -0.276667514   -0.2702568518     -0.18527922
## sentPerc        -0.47377444 -0.381914979   -0.5593255593     -0.27648380
## Change          -0.10922524  0.061069997    0.0089775011      0.06012198
##                 Count_Neutrals Sent_Negatives Sent_Positives     sentPerc
## Perc_Change        0.615482079   -0.106549662   -0.291928944 -0.473774435
## n                  0.978979215   -0.135178821   -0.276667514 -0.381914979
## Count_Negatives    0.887333575   -0.198448186   -0.270256852 -0.559325559
## Count_Positives    0.937032940   -0.152246804   -0.185279225 -0.276483796
## Count_Neutrals     1.000000000   -0.083017711   -0.337048695 -0.355829202
## Sent_Negatives    -0.083017711    1.000000000   -0.244692612  0.250403764
## Sent_Positives    -0.337048695   -0.244692612    1.000000000  0.574879577
## sentPerc          -0.355829202    0.250403764    0.574879577  1.000000000
## Change             0.082256169    0.080466288   -0.048121273  0.077167637
##                        Change
## Perc_Change     -0.1092252397
## n                0.0610699969
## Count_Negatives  0.0089775011
## Count_Positives  0.0601219801
## Count_Neutrals   0.0822561690
## Sent_Negatives   0.0804662880
## Sent_Positives  -0.0481212734
## sentPerc         0.0771676373
## Change           1.0000000000

mat2 <- cor(data3)
meltedmat2 <- melt(mat2)
ggplot(data = meltedmat2, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() +
  scale_fill_gradient2(low="red", high="forestgreen", mid = "white", midpoint = 0) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

This heatmap is quite similar to the hourly heatmap; Perc_Change is still most correlated with Count_Negatives and most negatively correlated with sentPerc. Perc_Change still seems to be more useful than Change, and most other relationships are similar to the hourly heatmap.

Predict test data on trained Model 3. Train data on first half of data. Test on second half.

set.seed(123)
train3 <- 1:(485/2)
trainData3 <- data3[train3, ]
testData3 <- data3[-train3, ]

trainedModel3 <- lm(Perc_Change ~ sentPerc + Count_Negatives + Count_Neutrals, trainData3)
summary(trainedModel3)

## 
## Call:
## lm(formula = Perc_Change ~ sentPerc + Count_Negatives + Count_Neutrals, 
##     data = trainData3)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0994948 -0.0266476 -0.0081957  0.0221842  0.1852532 
## 
## Coefficients:
##                    Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept)      5.7096e-02  1.8446e-02  3.0953 0.0022018 ** 
## sentPerc        -1.5172e-01  8.7698e-02 -1.7301 0.0849106 .  
## Count_Negatives  7.2160e-06  1.8630e-06  3.8733 0.0001388 ***
## Count_Neutrals   5.1144e-07  9.3538e-07  0.5468 0.5850448    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.042821 on 238 degrees of freedom
## Multiple R-squared:  0.43202,    Adjusted R-squared:  0.42486 
## F-statistic: 60.344 on 3 and 238 DF,  p-value: < 2.22e-16

testModel3 <- predict(trainedModel3, 
                      testData3[c("sentPerc", "Count_Negatives", "Count_Neutrals")])
testData3 <- mutate(testData3, Model_Value = testModel3*100) %>%
  mutate(Perc_Change = Perc_Change*100)

mse(actual = testData3$Perc_Change, predicted = testModel3)

## [1] 34.474181

plot(testData3$Perc_Change); plot(testData3$Model_Value)

We use our first (and not so great in the hourly data) model to test our linear model (this time using only sentPerc and n) using daily data rather than hourly. Though it has a low p-value and a relatively high Adjusted R-squared of 0.42486, it has quite a high mean squared error of 34.474. When plotted, our test model gets results quite different from our actual data, so we are dubious about the usability of this model. We do think that aggregating by day makes the data more predictable, as the daily data is slightly less volatile than the hourly data and we are less likely to have our model fail due to training it on very different data.

K-Nearest Neighbors

Here we take the average net sentiment at a given hour and attempt to guess whether Bitcoin price moves up or down. The data is split to test on the first half of the gathered information and train on the second half of data.

train <- (12357/2)
targetTrain <- data1$Change_Sign[(1:train)]
targetTest <- data1$Change_Sign[-(1:train)]
A <- data1$sentPercTotal[1:train]
B <- data1$sentPercTotal[-(1:train)]


set.seed(10)
knn10 <- knn(data.frame(A), data.frame(B), cl = targetTrain, k = 10)
knn100 <- knn(data.frame(A), data.frame(B), cl = targetTrain, k = 100)
knn300 <- knn(data.frame(A), data.frame(B), cl = targetTrain, k = 300)

accuracy10 <- sum(diag(table(knn10, targetTest))) / sum(table(knn10, targetTest))
accuracy100 <- sum(diag(table(knn100, targetTest))) / sum(table(knn100, targetTest))
accuracy300 <- sum(diag(table(knn300, targetTest))) / sum(table(knn300, targetTest))

accuracy10 * 100

## [1] 49.441657

accuracy100 * 100

## [1] 50.364137

accuracy300 * 100

## [1] 49.668231

The knn model is no better than random prediction. In fact, it is worse in the 10 fold KNN, which was accurate less than half the time.

Model Validation

Function: \(PercChange = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \varepsilon\) where \(X_1 = SentimentPercent\), \(X_2 = CountNegatives\), and \(X_3 = CountNeutrals\)

Run K-fold cross-validation with K = 5.

# Clean data for cross-validation
dataCV <- select(data2, c("Perc_Change", "sentPerc", "Count_Negatives", "Count_Neutrals"))

set.seed(8)
cv.error <- rep(0, 5)
folds<- 1:5

# Run 5-fold CV
for (k in folds) {
  train.on <- sample(1:nrow(dataCV), nrow(dataCV)*.8)
  trainCV <- dataCV[train.on, ]
  testCV <- dataCV[-train.on, ]
  
  trainedModelCV <- lm(Perc_Change ~ sentPerc + Count_Negatives + Count_Neutrals, trainCV)

  testModelCV <- predict(trainedModelCV, testCV[c("sentPerc", "Count_Negatives", "Count_Neutrals")])
  testCV <- mutate(testCV, Model_Value = testCV$sentPerc*100) %>%
    mutate(Perc_Change = Perc_Change*100)

  cv.error[k] <- mse(actual = testCV$Perc_Change, predicted = testModelCV)
  
}

# Determine CV Error (Average MSE)
mean(cv.error)

## [1] 2.7866292

We validated our model by running k-fold cross validation with k=5. We did this on our best model (model2) which was our second linear model that trained 80% random dates and tested the model on 20% of remaining test dates. This cross-validates by performing that model five times. The mean squared error of the cross validated model is smaller than the original model, so it is still valid and strong.

The Best Model

Our best model is a linear regression analysis of our hourly data of Bitcoin Perc_Change on Twitter sentPerc, Count_Negatives and Count_Neutrals with the equation \(PercChange = \beta_0 + \beta_1SentimentPercent + \beta_2CountNegatives + \beta_3CountNeutrals + \varepsilon\).

Bitcoin and Twitter Sentiment

Priya Sankar & Nadine Swenberg

March 5th, 2019