Kirby Arinder
09/18/2017
Data cleaning is a funny topic for a panel.
Steering a useful middle course is difficult.
So I won't!
I'm going to aim for both extremes, instead:
Why?
Data cleaning: Checking for and imposing
This creates clean data!
But you might be saying, “Stop!…”
In this context, semantic virtue is just truth. It's obvious why we need this!
But I'm not going to talk about it, for a few reasons:
In order to be useful, information doesn't just need to be true; it needs to obey certain constraints of form.
For instance, it should be:
If there is no answer to a question, is it because:
These conditions often need to be represented differently, because they are analyzed differently!
These are far from the only formal constraints on data, but you get the idea!
Data needs to be syntactically clean in order to be usefully read, analyzed and presented.
Now let's step back to the raw data.
The data as you receive it – you haven't manipulated, summarized, edited, or processed it in any way!
Obviously, it often violates the above formal constraints.
So it needs to be transformed. But it must be transformed reproducibly.
Non-reproducible transformation cannot be distinguished from invention!
Transformation is not replacement!
You must preserve your raw data in their original form.
In other words…
It's a good idea to push as much of your data transformation onto computer code as you can.
This is for several reasons!
But the most important reason, to my mind, is:
You need a recipe – workpapers that tell you all and only the fully defined operations necessary to produce your clean data from your raw data.
Working code is this recipe.
Cleaned data ought to have:
These explicit parameters should be documented in a separate file: the data dictionary or code book.
Well, it's a long story. But basically, this structure guarantees:
Well, yeah. But repeat to yourself this mantra;
Data storage, analysis, and display are distinct functions!
So let's review.
Data cleaning changes raw data to clean data by means of an explicit operationalized transformation recipe and documents the parameters of the clean data in a codebook or data dictionary.
So here's one example and two tricks.
Recursively, it involves checking someone else's data dictionary – which in this case, was contained in three documents, in PDF format, totaling over 1200 pages.
Obviously, that's somewhat impractical for many purposes!
Regular expressions are programming techniques that specify sets of character strings.
They can pick out patterns with surprising flexibility! For example, in one notational system:
More interestingly,
Which isn't intrinsically interesting, I grant you….
But it's a real example of the use of these things!
Regexes can be used:
Using these techniques, you can
(We went one step further and put the result into a webapp – again, presentation is distinct from analysis and storage.)
Questions?