MDMWG
7/18/2018
We're about 60% (58.75%) through with:
We have a very high range of orders of magnitude and a power-law distribution for many quantities.
For unstructured data
(This is after removing an outlier… about which more later.)
For structured data
Summary statistics (and graphics) need to take this fact into account!
But there's idiosyncratic data structure…
and then there are data problems.
We do have some of the latter.
Notable examples include:
But also:
These throw off some important summaries!
Notably, volume by DBMS – which seems to have important implications – is currently strongly influenced by a single agency with improbably large estimates across the board.
See earlier cautions regarding data. But beyond that…
Do we expect all unstructured data to grow homogeneously?
I think this would be much more interesting broken down by volume-by-DBMS…
But there are about three outliers distorting things, and they need thought.
Note the many zero-value data points along the axes.
Note also the correlation is much stronger by agency – we might tenuously theorize about this.
Will look like the above.
There are some challenges in appropriately representing costs
Questions?