Survey results: early analysis and graphics

MDMWG
7/18/2018

How early exactly?

We're about 60% (58.75%) through with:

cleaning
- normalization
- anomaly checks
preliminary analysis
- summary statistics
- elementary hypothesis formation
preliminary graphics
- mostly exploratory
- some attempts at user-friendliness/graphic professionalism

Some cautions

We have a very high range of orders of magnitude and a power-law distribution for many quantities.

For unstructured data

volume
- top decile has 90%
- top individual has 51%
growth
- top decile has 96%
- top individual has 74%

(This is after removing an outlier… about which more later.)

Some cautions

For structured data

volume
- top decile has 93%
- top individual has 40%
growth
- top decile has 96%
- top individual has 63%

Summary statistics (and graphics) need to take this fact into account!

Data problems

But there's idiosyncratic data structure…

and then there are data problems.

We do have some of the latter.

Data problems

Notable examples include:

Sizable variations from Benford's Law
Varying degrees of precision (1 s.d. vs. 7)
Outliers
- More data than humanity!
- Zero volume/growth
- Four-order differences b/w volume and growth

Data problems

But also:

Incomplete surveys
Internally incoherent answers
- count vs. breakdown
- update schedule interpretation difference
NA values and zeroes apparently used interchangeably

Data problems

These throw off some important summaries!

Notably, volume by DBMS – which seems to have important implications – is currently strongly influenced by a single agency with improbably large estimates across the board.

All that said...

plot of chunk unnamed-chunk-1

Notes

See earlier cautions regarding data. But beyond that…

Do we expect all unstructured data to grow homogeneously?

This one would ordinarily be vertical, but we're on a screen.

plot of chunk unnamed-chunk-2

Local control...

plot of chunk unnamed-chunk-3

Local control continued

I think this would be much more interesting broken down by volume-by-DBMS…

But there are about three outliers distorting things, and they need thought.

Structured data

$plot of chunk unnamed-chunk-4$

Structured data

Note the many zero-value data points along the axes.

Note also the correlation is much stronger by agency – we might tenuously theorize about this.

Backups

plot of chunk unnamed-chunk-5

And basically every other count...

Will look like the above.

A note on costs

There are some challenges in appropriately representing costs

Ours have optional nesting
There are outliers and improbable values

That's all, folks!

Questions?