Introduction
In this document, I analyze headlines from containing crypto (like bitcoin) headlines. This is a trial. I look to accomplish three things:
Create code so that John’s headlines will be immediately incorporated
Evidently, the dictionary used to perform sentiment analysis matters. Hence, I use the finance mapping. The list of stop words and sentiment can be found at Loughran and Mcdonald’s page.
Aggregation of sentiment is important as well. I follow the best practices laid out in the survey article
With the sentiment, Ziemek will go an event study and we can write up the results.
Data Cleaning and Text analysis methodology
In this part, I only analyze the title of the 2017 bitcoin dataframe given to me by John. It is relatively easy to amend the code to analyze the article description. The steps that I follow are:
First, I download the csv file 2017_bitcoin_dataframe.csv. The file contains 1796 news articles.
I make a list of all the stopwords from Loughran and Mcdonald’s page. These stop words include words that may be generic “are”, “at”; long generic words like “allow”, “almost”; dates and numbers, and auditor names. I could have chosen to not use auditor names but I did. I also do not use “names” as there was no need.
After I compile a master list of stopwords, I convert them to lower case.
I also compile a list of sentiments from their page. The sentiments are given in different worksheets. The list of possible sentiments are negative, positive, uncertainty, litigious, constraining, strong modal and weak modal. I do not plan to use constraining, strong modal and weak model sentiments.
I convert all the master lists to lower case; remove punctuation; remove numbers.
At this point, after assembling the stopwords and sentiment words, I can analyze the title text. The methodology steps are:
I tokenize the articles. So, for example, the first article title is “bitcoin prices” — the title has two words. By tokenizing, I split the title into two parts: “bitcoin” and “prices”. I do so for all the 1796 news articles.
Then, I remove all the stopwords. Note that I am able to do so because punctuation is removed; whitespace is stripped; everything is lowercase. For example, the second article is titled:
“How many bitcoins we can miner per a day with S7 AntMiner? Bitcoin India Inc”.
This second article has 15 words. After removing the stop words, there are 5 germane words to analyze.
Afterward, for each article, using the remaining words, I match them with positive and negative words. It turned out that the second article did not have any positive or negative words and I qualify that article has having a neutral sentiment.
It may be that a article has both positive and negative words. Then, using the survey article approach, I calculate the percent positive metric: (Number of positive words - Number of negative words) / Number of total cleaned words.
Last, using the UTC time, I also calculate the hour, month, weekday of the publicaton.
Analysis
Overall Exploratory Analysis
| Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Pctl(75) | Max |
| # cleaned words | 1,796 | 5.4 | 2.3 | 0 | 4 | 7 | 23 |
| # Neg Words | 1,796 | 0.1 | 0.4 | 0 | 0 | 0 | 4 |
| # Pos words | 1,796 | 0.1 | 0.2 | 0 | 0 | 0 | 2 |
| # Uncertain words | 1,796 | 0.02 | 0.2 | 0 | 0 | 0 | 2 |
| # Litigious words | 1,796 | 0.04 | 0.2 | 0 | 0 | 0 | 3 |
Table 1 shows the summary statistics. On average, after the clean up, each article had 5.4 words; an overwhelming majority of them were neutral as evidence by the number of negative and positive words. It is interesting that there are more litigous words relative to uncertain words. This finding is consistent with John who also found that most articles were neutral.
Figure 1 shows the box plot of the percent positive words group by the day of the week. Two observations are in order. First, it does not look like there are the mean of negative titles changes by the day. Second, positive words may have a pattern. It seems that most positive articles come out on Friday and Sunday.
Figure 2 shows the box plot of the percent of positive words by month. The lack of a box means that majority of the articles are neutral. But, for some reason, the variability seems to be increasing with month. Also, something probably happened in February where majority of articles are negative. Note that this is only for 2017. This may change after I add 2018 and other articles.