Introduction

In this document, I analyze headlines from containing crypto (like bitcoin) headlines. This is a trial. I look to accomplish three things:

  1. Create code so that John’s headlines will be immediately incorporated

  2. Evidently, the dictionary used to perform sentiment analysis matters. Hence, I use the finance mapping. The list of stop words and sentiment can be found at Loughran and Mcdonald’s page.

  3. Aggregation of sentiment is important as well. I follow the best practices laid out in the survey article

With the sentiment, Ziemek will go an event study and we can write up the results.

Data Cleaning and Text analysis methodology

In this part, I only analyze the title of the 2017 bitcoin dataframe given to me by John. It is relatively easy to amend the code to analyze the article description. The steps that I follow are:

  1. First, I download the csv file 2017_bitcoin_dataframe.csv. The file contains 1796 news articles.

  2. I make a list of all the stopwords from Loughran and Mcdonald’s page. These stop words include words that may be generic “are”, “at”; long generic words like “allow”, “almost”; dates and numbers, and auditor names. I could have chosen to not use auditor names but I did. I also do not use “names” as there was no need.

  3. After I compile a master list of stopwords, I convert them to lower case.

  4. I also compile a list of sentiments from their page. The sentiments are given in different worksheets. The list of possible sentiments are negative, positive, uncertainty, litigious, constraining, strong modal and weak modal. I do not plan to use constraining, strong modal and weak model sentiments.

  5. I convert all the master lists to lower case; remove punctuation; remove numbers.

At this point, after assembling the stopwords and sentiment words, I can analyze the title text. The methodology steps are:

  1. I tokenize the articles. So, for example, the first article title is “bitcoin prices” — the title has two words. By tokenizing, I split the title into two parts: “bitcoin” and “prices”. I do so for all the 1796 news articles.

  2. Then, I remove all the stopwords. Note that I am able to do so because punctuation is removed; whitespace is stripped; everything is lowercase. For example, the second article is titled:

“How many bitcoins we can miner per a day with S7 AntMiner? Bitcoin India Inc”.

This second article has 15 words. After removing the stop words, there are 5 germane words to analyze.

  1. Afterward, for each article, using the remaining words, I match them with positive and negative words. It turned out that the second article did not have any positive or negative words and I qualify that article has having a neutral sentiment.

  2. It may be that a article has both positive and negative words. Then, using the survey article approach, I calculate the percent positive metric: (Number of positive words - Number of negative words) / Number of total cleaned words.

  3. Last, using the UTC time, I also calculate the hour, month, weekday of the publicaton.

Analysis

Overall Exploratory Analysis

Table 1: Descriptive statistics of the 2017 bitcoin dataframe
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
# cleaned words 1,796 5.4 2.3 0 4 7 23
# Neg Words 1,796 0.1 0.4 0 0 0 4
# Pos words 1,796 0.1 0.2 0 0 0 2
# Uncertain words 1,796 0.02 0.2 0 0 0 2
# Litigious words 1,796 0.04 0.2 0 0 0 3

Table 1 shows the summary statistics. On average, after the clean up, each article had 5.4 words; an overwhelming majority of them were neutral as evidence by the number of negative and positive words. It is interesting that there are more litigous words relative to uncertain words. This finding is consistent with John who also found that most articles were neutral.

Figure 1 shows the box plot of the percent positive words group by the day of the week. Two observations are in order. First, it does not look like there are the mean of negative titles changes by the day. Second, positive words may have a pattern. It seems that most positive articles come out on Friday and Sunday.

Figure 2 shows the box plot of the percent of positive words by month. The lack of a box means that majority of the articles are neutral. But, for some reason, the variability seems to be increasing with month. Also, something probably happened in February where majority of articles are negative. Note that this is only for 2017. This may change after I add 2018 and other articles.

Bias of Authors

Overall, in the data, there were 119 authors who had written more than two articles. Figure 3 shows the positively biased authors whose bias tends to be positive. Overall, there were 17 positively biased authors. For contrast, I depict the authors who have never written a negative headline in blue. The level of bias is clear.

Figure 4 shows the negatively biased authors. Overall, there were 48 negatively biased authors. For contrast, I depict the authors who have never written a posiive headline in blue. Again, the level of bias is clear.

All the other authors are totally unbiased. They do not seem to have used any positive or negative words.

In the analysis, we need to add author fixed effects or bias fixed effects