Some very exploratory EDA for returns & news-article data

This document contains some initial EDA for the merged returns and article data-set. The outline is as follows:

  1. Basic data-set filters
  2. Checking distributions of drift
  3. Transition matrix between 3-7 day over-reactions
  4. Outcome comparison across basic counts
  5. Drift vs. Article type
  6. Distribution of publishing dates
  7. Distribution of tickers
  8. Distribution of publishers

Basic data filters:

We have some duplicated IDs which are the same article, same company, and same date, but with different outcome. Filter No.1 removes these and we don’t loose much data.

However we also have a significant proportion of our data with multiple articles on the same day about the same company. This makes our outcome variable indistinguishable between the treatment (i.e. article being written). How do we want to address this? I could see a world where we don’t care about this in the training process, though it’s also impossible to draw a link between “this article leads to this over-reaction” in that setting.

Filter No.2 samples one at random and is used throughout.

Filter No.3 removes these rows.

Distribution of outcome variables

Transition matrix of 3-7 day drift outcome

Interestingly a significant proportion of our data is not an over-reaction after 3 days, but then after 7 (as well as the other way around). Again, making the link between article and outcome is difficult over 7 days (even more so over larger horizons) since we don’t observe what other shocks occur inbetween. We may also not care about this, though again this makes it difficult for the model to learn the article –> outcome link if there is noise driving the outcome outside of the observables given to the model (i.e. the article). This becomes even more so the case when comparing 3 and 90 day drift.

Important caveat I don’t know much about drift ! This may well all be a mute point, just thought I’d bring it up.

I stick with the 3-day drift as the outcome unless otherwise stated

Checking basic counts between outcomes

No big suprises here which is great !

Simple outcome sumary table
For 3 and 90-day returns
Over-reaction N unique tickers N unique providers N unique ids N news vs. opinion Outcome horizon
FALSE 735 602 49258 21117 90 Day drift
TRUE 734 617 48266 20586 90 Day drift
FALSE 734 604 48796 20970 3 Day drift
TRUE 737 617 48728 20733 3 Day drift

Drift distribution between article type

Simple outcome sumary table of news categories
For 3-day returns
Category Mean over-reaction
news 0.4971585
opinion 0.5015138
Simple outcome sumary table of news categories
For 3-day returns
Category Mean over-reaction
news 0.4936336
opinion 0.4958707

Distribution of publishing dates:

There is more dispersion in our outcome variable early on in our sample (years 2008 - 2011) which corresponds to the years with much fewer observations. No big differences between outcomes

Distribution of companies:

We have a few very heavily represented companies. I’m not sure if we care about this potentially biasing our results?

Distribution of publishers:

We have a few very heavily represented publishers (see table) I’m not sure if we care about this potentially biasing our results?

Counting number of publications
For 3-day returns
Publisher N articles Mean outcome
Zacks Investment Research 40411 0.5000371
Reuters 23696 0.4947670
Investing.com 7841 0.5041449
Seeking Alpha 6916 0.4982649
Bloomberg 1253 0.4836393
Nicholas Santiago 748 0.5294118
Gregory W. Harmon 697 0.5236729
Ryan Mallory 621 0.4830918
International Business Times 553 0.5189873
Estimize 498 0.5160643
Harry Boxer 464 0.5193966
The Motley Fool 429 0.4965035
Dr. Duru 334 0.4550898
Benzinga 315 0.5174603
Tim Knight 301 0.5614618
Edison 269 0.4721190
Cointelegraph 268 0.5261194
Dividend Yield 213 0.4976526
iFOREX 195 0.4615385
Cryptovest 186 0.4677419
Trade The News 181 0.5580110
Binary Options Strategy 180 0.4944444
Marvin Clark 175 0.4571429
MetalMiner 167 0.4910180
The Night Owl Trader 161 0.5031056
David Trainer 159 0.5471698
LFB Forex 157 0.5159236
Charles Sizemore 153 0.5098039
The Gold Report 149 0.5436242
Frank Holmes 142 0.5070423
Business Insider 141 0.4397163
Gareth Soloway 139 0.4028777
Haris Anwar/Investing.com 137 0.5693431
CNBC 135 0.5185185
Warrior Trading 128 0.4531250
IFC Markets 123 0.4308943
Pinchas Cohen/Investing.com 112 0.4375000
Danske Markets 107 0.4672897
Brian Gilmartin 106 0.5471698
Wall Street Daily 103 0.5339806

Looking at average number of days between articles

In this section I do a few things:

  1. Plotting the distribution of the average N-days between article publications by companies
  2. Distribution of N-articles by companies
  3. Plotting a composite measure

Point 3 deserves some explanation:

  • I want to check for the “balance” of our sample of articles
  • We have two metrics: (1) avr. period between articles and (2) total number of articles
  • A company could be “over” represented if it’s got articles published every day (far below the mean duration) or has tons or articles written about it
  • For #3 I divide (# Articles) / (Avr. duration for company). I then standardize the range between 0 - 1
  • As you can see, by this metric our sample is weighted super heavily towards a few top tech companies (check out the last plot)

Distribution of avr. duration:

Number of articles per company:

Weighted average N-articles:

Industry distribution

Sampling some top/bottom drift titles:

Top 10 randomly sampled headlines:

Sampling 10 headlines
For max 3-day and 90-day returns
ID Ticker Drift Headline Outcome
362282 GOOGL 19.11131 Vol Spikes On Analyst Takeover Talk Is Samsung Scared 3 days
411227 COP 19.11131 Top Off The Portfolio With Phillips 66 3 days
441081 WFC 19.11131 Stocks And ETFs Start Strong Skid Into The Close 3 days
385789 FDX 19.11131 UPS tops expectations in holiday quarter 3 days
301974 MSFT 19.11131 Here s What The Buy Side Expects from Microsoft On Tuesday 3 days
424576 GM 123.56968 Hyundai Motor profit slumps warns China U S sales malaise to persist 90 days
410095 COP 123.56968 How Australia East Timor treaty unlocks 65 billion gas fields 90 days
439716 WFC 123.56968 EUR USD Euro s False Start 90 days
238542 FCX 123.56968 iFOREX Daily Analysis August 25 2016 90 days
301978 MSFT 123.56968 Is Old Tech MSFT Topping 90 days

Bottom 10 randomly sampled headlines:

Sampling 10 headlines
For min 3-day and 90-day returns
ID Ticker Drift Headline Outcome
312975 GE -18.59798 Exclusive GE seeking to shed troubled insurance business sources 3 days
353282 CMI -18.59798 Arkansas holds fourth execution in eight days 3 days
377213 UNP -18.59798 Buffett Indicator Soars 3 days
395250 EMR -18.59798 Biden Sanders Defeat Trump in Poll Campaign Update 3 days
425553 MRK -18.59798 Juno Therapeutics Makes Revolutionary Breakthrough In Cancer Treatment 3 days
425680 GM -123.06453 Exclusive Fiat Chrysler delays many future vehicle programs 90 days
243988 CMTL -123.06453 Ericsson Extends Collaboration With NBN Co In Australia 90 days
236538 VAR -123.06453 Varian s ProBeam Now Available In Multi Room Configuration 90 days
256352 EPAC -123.06453 Why Earnings Season Could Be Great For Actuant ATU 90 days
431669 MS -123.06453 Monsanto cuts first quarter earnings guidance 90 days