Transforming messy data, creating tidy data

PEER staff
09/04/2024

Everybody likes clean data.

And nobody likes messy data!

Sometimes it seems like most of the data we get is the latter!

These can be just words of approval or disapproval…

But calling data "messy" is often shorthand for real issues!

Maybe the data:

  • include a lot of falsehoods?
  • are difficult to interpret or analyze?
  • are not clearly documented?

All reasonable complaints!

But it would be nice to have some conceptual clarity (and practical guidance) around the issue.

So let's try that!

Here's what we're going to talk about today:

  • The theory of data cleaning, which enables
    • transforming received data and
    • producing internal data;
  • An imaginary example to illustrate the point; and
  • A real example: What is clean data for GIS purposes?

So let's start with some definitions.

Data cleaning: Checking for and imposing

  • either
    • semantic virtue (truth) or
    • syntactic virtue (formal validity)
  • within and upon raw data
  • by means of a documented, repeatable process.

This creates clean data!

Semantic virtue?

In this context, semantic virtue is just truth. It's obvious why we need this!

But I'm not going to talk about it, for a few reasons:

  • It resists theoretical reduction.
  • The process of finding it isn't particularly distinct from the rest of the job.
  • Arguably, it's not even our primary concern (though this is a long digression).

Okay... syntactic virtue, then.

In order to be useful, information doesn't just need to be true; it needs to obey certain constraints of form.

For instance, it should be:

  • readable and useful
    • A blunt needle in hand is more useful than a sharp one in a haystack! (Note that utility is contextual to an analysis)
  • internally consistent and uniform
    • Mixing Fahrenheit and Celsius without warning leads to confusion!
  • composed of appropriate data types and ranges
    • “True” isn't a height; “December 48th” isn't a date.

Even missing data should obey formal constraints!

If there is no answer to a question, is it because:

  • We don't know the answer? (What is Genghis Khan's birthday?)
  • The answer violates expectations of the question? (List the people who have walked on Mars.)
  • There is no answer yet? (What is my date of death?)
  • There is no answer at all? (What is the name of the King of France in 2024?)

These conditions often need to be represented differently, because they are analyzed differently!

Syntactic virtue continued

These are far from the only formal constraints on data, but you get the idea!

Data needs to be syntactically virtuous in order to be usefully read, analyzed and presented.

Now let's step back to the raw data.

Raw data?

The data as you receive it – you haven't manipulated, summarized, edited, or processed it in any way!

Obviously, it often violates the above formal constraints.

So it needs to be transformed. But it must be transformed reproducibly.

Non-reproducible transformation cannot be distinguished from invention!

But take note...

Transformation is not replacement!

You must preserve your raw data in their original form.

In other words…

Your raw data are important...

But don't open them up and mess around.

Data transformation

It's a good idea to push as much of your data transformation onto computer code as you can.

This is for several reasons!

  • Avoids idiosyncratic error
  • Saves time on large datasets
    • Your procedures need to optimize staff time!
  • Can be repeated in similar future cases
    • “If you have to do it twice, you'll have to do it many times, so automate it.”

Data transformation

But the most important reason, to my mind, is:

  • Functional computer code is by definition fully operationalized.

You need a recipe – workpapers that tell you all and only the fully defined operations necessary to produce your clean data from your raw data.

Working code is this recipe.

So what do you get at the end?

Cleaned data ought to have:

  • Each variable in its own column (and no columns that aren't variables)
  • Each observation in its own row (and no rows that aren't observations)
  • Each unit of observation in its own table
  • Each observation of a given variable within explicit parameters for that variable

These explicit parameters should be documented in a separate file: the data dictionary or code book.

Why?

Well, it's a long story. But basically, this structure guarantees:

  • Easier computation
  • More efficient storage
  • Greater resistance to data corruption by editing

But it's just a block of cells now!

Well, yeah. But repeat to yourself this mantra;

Data storage, analysis, and display are distinct functions!

So let's review.

Or...

The summary.

Data cleaning changes raw data to clean data by means of an explicit operationalized transformation recipe and documents the parameters of the clean data in a codebook or data dictionary.

Practical examples and a short digression

So we have the theory; now let's talk about the practice.

But first, remember: One aspect of syntactic virtue is useful formatting.

And utility is relative to the kind of analysis you're doing!

Which in turn means...

Analytic clarity and clean data form a feedback loop!

The clearer you are about what questions you want to ask, the easier it is to know how your data should be cleaned.

The cleaner your data, the easier it is to answer your questions.

So let's look at an (imaginary) example.

STANDARDIZED URCHIN TESTS …2
Bob Smallchild
ADDRESSES
344 Clinton Ave. Apt 3D
TEST SCORES
Chimneysweeping - 38 - July 21
Sniveling - 402 - January 10
Chimneysw. - 66 - September 12
Emma McWaif
ADDRESSES
221B Baker Street
Down the street from the old oak, take a left
TEST SCORES
Chimneysweeping - 85% - 22/7/21
Cockney - 77% - 22/7/21

What do we notice about this dataset?

Where to even start?

  • Spreadsheet and column titles are treated as data.
  • Rows are not observations
  • Columns are not variables

And not only that...

  • Semantic vice abounds:
    • Bad values
    • Bad formatting
    • Conjoined data
    • Inconsistent entity nomenclature

But why does all that matter?

Consider the practical difficulties of analysis!

Let's assume we want to do:

  • synchronic and
  • diachronic comparisons

by

  • test type
  • geographic loction
  • and student.

These things are all very difficult in the raw data.

This holds true whether you're working from Excel formulae or doing code-based analysis!

(If the dataset is big enough, manual analysis is prohibitive in any case.)

Let's go back and look.

So remember...

Cleaned data ought to have:

  • Each variable in its own column
  • Each observation in its own row
  • Each observation of a given variable within explicit parameters for that variable

(We will return shortly to the idea that each unit of observation has its own table)

Here's a cleaned version:

FN LN Address Test Score% Mon Day Yr
Bob Smallchild 344 Cli… Chimney… 38.0 7 21 2022
Bob Smallchild 344 Cli… Chimney… 66.0 9 12 2022
Bob Smallchild 344 Cli… Sniveli… 40.2 1 10 2022
Emma McWaif 221B Ba… Chimney… 85.0 7 21 2022
Emma McWaif 221B Ba… Cockney 77.0 7 21 2022
Emma McWaif BADVAL Chimney… 85.0 7 21 2022
Emma McWaif BADVAL Cockney 77.0 7 21 2022

Some notes

Obviously I made some transformations that you'd want to back up with research.

  • Interpretation of an out-of-bounds test score
  • Treatment of bad values
  • Interpretation of ambiguous dates

Some notes

This does create visually unwieldy duplication of particular fields…

But remember our mantra: Data storage, analysis, and display are distinct functions!

But if we really want to get clean...

We can return to our old friend the unit of observation.

Units of observation are the entities picked out by each row of data.

But sometimes, they can be decomposed into smaller units!

For example

In our cleaned dataset, the unit of observation is person-test-address-occasions!

(This is why I am not distinguishing units of observation from units of analysis: The distinction can involve ontological gerrymandering.)

But instead, we could decompose them into

Person-addresses...

FN LN Address
Bob Smallchild 344 Clinton Ave. Apt 3D
Emma McWaif 221B Baker Street
Emma McWaif BADVAL

And person-test-occasions!

FN LN Test Score% Mon Day Yr
Bob Smallchild Chimney… 38.0 7 21 2022
Bob Smallchild Chimney… 66.0 9 12 2022
Bob Smallchild Sniveli… 40.2 1 10 2022
Emma McWaif Chimney… 85.0 7 21 2022
Emma McWaif Cockney 77.0 7 21 2022

This greatly increases efficiency for big datasets...

But the truth is, modern computational power so far outpaces typical file size that for our work, it's mostly not necessary.

If you're producing data instead of cleaning

It's worth thinking:

  • What kind of sortability do you need? By FN/LN? By address components? Can you consolidate?
  • Avoid the need for special data types when possible, notably when handling
    • money and
    • dates (YYYYMMDD is your friend)!

So now let's move from theory into reality!

The coast

The data

Let's look at the data!

The state

The URL for this presentation

Ex. 1: Recursivity

That's all, folks!

Questions?