Data cleaning: theory and practice

Kirby Arinder
09/18/2017

Basics

Data cleaning is a funny topic for a panel.

  • The theory isn't controversial.
  • The practice isn't easily given a nontrivial summary!

Steering a useful middle course is difficult.

Basics

So I won't!

Come again?

I'm going to aim for both extremes, instead:

  • A little theory (with axiology)
  • A little practice (with an example)

Why?

  • The first is aimed at high-level clarity regarding motivations and practices;
  • the second at providing a starting point for specific activities.

So let's start with some definitions.

Data cleaning: Checking for and imposing

  • either
    • semantic virtue (truth) or
    • syntactic virtue (validity)
  • within and upon raw data
  • by means of a documented, repeatable process.

This creates clean data!

But you might be saying, “Stop!…”

"... please explain further, good sir."

Semantic virtue?

In this context, semantic virtue is just truth. It's obvious why we need this!

But I'm not going to talk about it, for a few reasons:

  • It resists theoretical reduction.
  • The process of finding it isn't particularly distinct from the rest of the job.
  • Arguably, it's not even our primary concern (though this is a long digression).

Okay... syntactic virtue, then.

In order to be useful, information doesn't just need to be true; it needs to obey certain constraints of form.

For instance, it should be:

  • readable and useful
    • A blunt needle in hand is more useful than a sharp one in a haystack!
  • internally consistent and uniform
    • Mixing Fahrenheit and Celsius without warning leads to confusion!
  • composed of appropriate data types and ranges
    • “True” isn't a height; “December 48th” isn't a date.

Even missing data should obey formal constraints!

If there is no answer to a question, is it because:

  • We don't know the answer? (What is Genghis Khan's birthday?)
  • We know the answer, and it's zero? (How many times have people walked on Mars?)
  • There is no answer yet? (What is my date of death?)
  • There is no answer at all? (What is the name of the King of France in 2017?)

These conditions often need to be represented differently, because they are analyzed differently!

Syntactic virtue continued

These are far from the only formal constraints on data, but you get the idea!

Data needs to be syntactically clean in order to be usefully read, analyzed and presented.

Now let's step back to the raw data.

Raw data?

The data as you receive it – you haven't manipulated, summarized, edited, or processed it in any way!

Obviously, it often violates the above formal constraints.

So it needs to be transformed. But it must be transformed reproducibly.

Non-reproducible transformation cannot be distinguished from invention!

But take note...

Transformation is not replacement!

You must preserve your raw data in their original form.

In other words…

Your raw data are important...

plot of chunk unnamed-chunk-1

But don't open them up and mess around.

plot of chunk unnamed-chunk-2

Data transformation

It's a good idea to push as much of your data transformation onto computer code as you can.

This is for several reasons!

  • Avoids idiosyncratic error
  • Saves time on large datasets
    • Your procedures need to optimize staff time!
  • Can be repeated in similar future cases
    • “If you have to do it twice, you'll have to do it many times, so automate it.”

Data transformation

But the most important reason, to my mind, is:

  • Functional computer code is by definition fully operationalized.

You need a recipe – workpapers that tell you all and only the fully defined operations necessary to produce your clean data from your raw data.

Working code is this recipe.

So what do you get at the end?

Cleaned data ought to have:

  • Each variable in its own column
  • Each observation in its own row
  • Each unit of observation in its own table
  • Each observation of a given variable within explicit parameters for that variable

These explicit parameters should be documented in a separate file: the data dictionary or code book.

Why?

Well, it's a long story. But basically, this structure guarantees:

  • Easier computation
  • More efficient storage
  • Greater resistance to data corruption by editing

But it's just a block of cells now!

Well, yeah. But repeat to yourself this mantra;

Data storage, analysis, and display are distinct functions!

So let's review.

Or...

plot of chunk unnamed-chunk-3

The summary.

Data cleaning changes raw data to clean data by means of an explicit operationalized transformation recipe and documents the parameters of the clean data in a codebook or data dictionary.

The Case of the 1200-Page PDF

So here's one example and two tricks.

Recursively, it involves checking someone else's data dictionary – which in this case, was contained in three documents, in PDF format, totaling over 1200 pages.

Obviously, that's somewhat impractical for many purposes!

Trick one: OCR/PDF extraction

Here's a link about OCR and PDF extraction

Here's one that's just about OCR.

And here's one that's just PDFs.

That's going to give you a long sequence of text – not a table! So what then?

Trick two: Regular expressions

Regular expressions are programming techniques that specify sets of character strings.

They can pick out patterns with surprising flexibility! For example, in one notational system:

  • [a-z] picks out any lower case letter – a, b, c, but not A or 3 or !
  • [h|r|b|B]at picks out hat, rat, Bat, and bat, but not Hat or cat

Regular expressions continued

More interestingly,

  • [A-Z][a-z].+[0-9]{1,2}/[0-9]{1,2}/[0-9]{4} picks out any upper-case letter followed by a lower-case letter followed by a sequence of any set of characters (except a carriage return) of any length one or greater, followed by one or two numbers, a slash, one or two numbers, a slash, and four numbers.

Which isn't intrinsically interesting, I grant you….

But it's a real example of the use of these things!

Regular expressions continued

Regexes can be used:

  • in the process of organizing a disorganized vector of text into tabular data
  • within tabular data to ensure adherence to semantic constraints

Putting it all together

Using these techniques, you can

  • take data locked away behind a picture or PDF
  • extract them to text
  • organize them into tabular data by regular features
  • and clean them with regular expressions and elbow grease!

(We went one step further and put the result into a webapp – again, presentation is distinct from analysis and storage.)

Final notes

The URL for this presentation:

rpubs.com/CKirbyArinder/NLPES2017DC

Ex. 1: Recursivity

plot of chunk unnamed-chunk-4

That's all, folks!

Questions?