MDMWG report progress

MS Data Management Working Group
6/20/18

Where are we so far?

It's not a bad place.

  • The survey is finished
  • We have (some) data
  • It's time to think about the future!

So:

What will our final product look like?

Conceptually, the final report has two main parts:

  • A discussion of best practices
  • Presentation of our findings

But what about the executive summary, you ask?

Well, good point.

There are other parts to the report.

  • Dr. Rials covered these quite well in outline at the last meeting!
  • Most of them act in support to or summary of those I mentioned.

So I'd like to cover just the two I mentioned in a bit more depth!

Best practices

The questions we've asked mostly count things or place them in categories.

We've also chosen a fairly high level of analysis

  • Mostly, it's the agency as a whole, though there are exceptions.

This was to get results.

  • The (impractical) perfect is the enemy of the (practical) good!

But...

This limits the formative use of our analysis!

Most causal and even relational analyses are ruled out (or very complicated).

Data influences analysis (duh)

Counts and categories are useful to get the lay of the land.

But they don't explain or show a way forward!

That's not good or bad – it's just the nature of the data.

An aside...

Some of the more complicated analyses might be useful in-house.

But I'm not sure we want to present high-dimensional regressions here!

cost ~ software mix + agency + agency size + data volume might be pretty interesting, though…

Get back to the report already!

Okay, okay.

What this means is that a lot of weight will be carried by the best practices section.

And that means we should devote some thought to how it should look!

In particular...

While we can mine the literature, we need focus.

Best practices are always best for some purpose.

What do we want to emphasize, in what proportion?

  • Security?
  • Cost?
  • Speed?

A suggestion, which I'll leave on the table

The point of data is to let us know stuff. And know that we know!

Fundamentally, collections of data must:

  • enable valid inferences regarding their domain
  • even inferences not originally envisioned by the designer
  • in such a way that those inferences are justifiable.

If you don't have this, all other considerations fall by the wayside.

But what about the data themselves?

As I said, this is going to be largely descriptive.

And we hope to tell a lot of our story graphically.

Some of our graphic choices will arise from the data.

For instance, let's consider the questions about data volume and growth by DBMS.

We might see several things…

Possibilities...

plot of chunk unnamed-chunk-1

Possibilities...

plot of chunk unnamed-chunk-2

Possibilities...

plot of chunk unnamed-chunk-3

But some of our choices will be mostly aesthetic.

plot of chunk unnamed-chunk-4

This creates a certain impression...

plot of chunk unnamed-chunk-5

Whereas a different plot of the same data...

plot of chunk unnamed-chunk-6

Creates a different impression.

plot of chunk unnamed-chunk-7

That's all, folks!

Questions?