This document contains some initial EDA for the merged returns and article data-set. The outline is as follows:
We have some duplicated IDs which are the same article, same company, and same date, but with different outcome. Filter No.1 removes these and we don’t loose much data.
However we also have a significant proportion of our data with multiple articles on the same day about the same company. This makes our outcome variable indistinguishable between the treatment (i.e. article being written). How do we want to address this? I could see a world where we don’t care about this in the training process, though it’s also impossible to draw a link between “this article leads to this over-reaction” in that setting.
Filter No.2 samples one at random and is used throughout.
Filter No.3 removes these rows.
Interestingly a significant proportion of our data is not an over-reaction after 3 days, but then after 7 (as well as the other way around). Again, making the link between article and outcome is difficult over 7 days (even more so over larger horizons) since we don’t observe what other shocks occur inbetween. We may also not care about this, though again this makes it difficult for the model to learn the article –> outcome link if there is noise driving the outcome outside of the observables given to the model (i.e. the article). This becomes even more so the case when comparing 3 and 90 day drift.
Important caveat I don’t know much about drift ! This may well all be a mute point, just thought I’d bring it up.
I stick with the 3-day drift as the outcome unless otherwise stated
No big suprises here which is great !
| Simple outcome sumary table | |||||
|---|---|---|---|---|---|
| For 3 and 90-day returns | |||||
| Over-reaction | N unique tickers | N unique providers | N unique ids | N news vs. opinion | Outcome horizon |
| FALSE | 735 | 602 | 49258 | 21117 | 90 Day drift |
| TRUE | 734 | 617 | 48266 | 20586 | 90 Day drift |
| FALSE | 734 | 604 | 48796 | 20970 | 3 Day drift |
| TRUE | 737 | 617 | 48728 | 20733 | 3 Day drift |
| Simple outcome sumary table of news categories | |
|---|---|
| For 3-day returns | |
| Category | Mean over-reaction |
| news | 0.4971585 |
| opinion | 0.5015138 |
| Simple outcome sumary table of news categories | |
|---|---|
| For 3-day returns | |
| Category | Mean over-reaction |
| news | 0.4936336 |
| opinion | 0.4958707 |
There is more dispersion in our outcome variable early on in our sample (years 2008 - 2011) which corresponds to the years with much fewer observations. No big differences between outcomes
We have a few very heavily represented companies. I’m not sure if we care about this potentially biasing our results?
We have a few very heavily represented publishers (see table) I’m not sure if we care about this potentially biasing our results?
| Counting number of publications | ||
|---|---|---|
| For 3-day returns | ||
| Publisher | N articles | Mean outcome |
| Zacks Investment Research | 40411 | 0.5000371 |
| Reuters | 23696 | 0.4947670 |
| Investing.com | 7841 | 0.5041449 |
| Seeking Alpha | 6916 | 0.4982649 |
| Bloomberg | 1253 | 0.4836393 |
| Nicholas Santiago | 748 | 0.5294118 |
| Gregory W. Harmon | 697 | 0.5236729 |
| Ryan Mallory | 621 | 0.4830918 |
| International Business Times | 553 | 0.5189873 |
| Estimize | 498 | 0.5160643 |
| Harry Boxer | 464 | 0.5193966 |
| The Motley Fool | 429 | 0.4965035 |
| Dr. Duru | 334 | 0.4550898 |
| Benzinga | 315 | 0.5174603 |
| Tim Knight | 301 | 0.5614618 |
| Edison | 269 | 0.4721190 |
| Cointelegraph | 268 | 0.5261194 |
| Dividend Yield | 213 | 0.4976526 |
| iFOREX | 195 | 0.4615385 |
| Cryptovest | 186 | 0.4677419 |
| Trade The News | 181 | 0.5580110 |
| Binary Options Strategy | 180 | 0.4944444 |
| Marvin Clark | 175 | 0.4571429 |
| MetalMiner | 167 | 0.4910180 |
| The Night Owl Trader | 161 | 0.5031056 |
| David Trainer | 159 | 0.5471698 |
| LFB Forex | 157 | 0.5159236 |
| Charles Sizemore | 153 | 0.5098039 |
| The Gold Report | 149 | 0.5436242 |
| Frank Holmes | 142 | 0.5070423 |
| Business Insider | 141 | 0.4397163 |
| Gareth Soloway | 139 | 0.4028777 |
| Haris Anwar/Investing.com | 137 | 0.5693431 |
| CNBC | 135 | 0.5185185 |
| Warrior Trading | 128 | 0.4531250 |
| IFC Markets | 123 | 0.4308943 |
| Pinchas Cohen/Investing.com | 112 | 0.4375000 |
| Danske Markets | 107 | 0.4672897 |
| Brian Gilmartin | 106 | 0.5471698 |
| Wall Street Daily | 103 | 0.5339806 |
In this section I do a few things:
Point 3 deserves some explanation:
| Sampling 10 headlines | ||||
|---|---|---|---|---|
| For max 3-day and 90-day returns | ||||
| ID | Ticker | Drift | Headline | Outcome |
| 362282 | GOOGL | 19.11131 | Vol Spikes On Analyst Takeover Talk Is Samsung Scared | 3 days |
| 411227 | COP | 19.11131 | Top Off The Portfolio With Phillips 66 | 3 days |
| 441081 | WFC | 19.11131 | Stocks And ETFs Start Strong Skid Into The Close | 3 days |
| 385789 | FDX | 19.11131 | UPS tops expectations in holiday quarter | 3 days |
| 301974 | MSFT | 19.11131 | Here s What The Buy Side Expects from Microsoft On Tuesday | 3 days |
| 424576 | GM | 123.56968 | Hyundai Motor profit slumps warns China U S sales malaise to persist | 90 days |
| 410095 | COP | 123.56968 | How Australia East Timor treaty unlocks 65 billion gas fields | 90 days |
| 439716 | WFC | 123.56968 | EUR USD Euro s False Start | 90 days |
| 238542 | FCX | 123.56968 | iFOREX Daily Analysis August 25 2016 | 90 days |
| 301978 | MSFT | 123.56968 | Is Old Tech MSFT Topping | 90 days |
| Sampling 10 headlines | ||||
|---|---|---|---|---|
| For min 3-day and 90-day returns | ||||
| ID | Ticker | Drift | Headline | Outcome |
| 312975 | GE | -18.59798 | Exclusive GE seeking to shed troubled insurance business sources | 3 days |
| 353282 | CMI | -18.59798 | Arkansas holds fourth execution in eight days | 3 days |
| 377213 | UNP | -18.59798 | Buffett Indicator Soars | 3 days |
| 395250 | EMR | -18.59798 | Biden Sanders Defeat Trump in Poll Campaign Update | 3 days |
| 425553 | MRK | -18.59798 | Juno Therapeutics Makes Revolutionary Breakthrough In Cancer Treatment | 3 days |
| 425680 | GM | -123.06453 | Exclusive Fiat Chrysler delays many future vehicle programs | 90 days |
| 243988 | CMTL | -123.06453 | Ericsson Extends Collaboration With NBN Co In Australia | 90 days |
| 236538 | VAR | -123.06453 | Varian s ProBeam Now Available In Multi Room Configuration | 90 days |
| 256352 | EPAC | -123.06453 | Why Earnings Season Could Be Great For Actuant ATU | 90 days |
| 431669 | MS | -123.06453 | Monsanto cuts first quarter earnings guidance | 90 days |