Transforming messy data, creating tidy data

PEER staff
09/04/2024

Everybody likes clean data.

And nobody likes messy data!

Sometimes it seems like most of the data we get is the latter!

These can be just words of approval or disapproval…

But calling data "messy" is often shorthand for real issues!

Maybe the data:

include a lot of falsehoods?
are difficult to interpret or analyze?
are not clearly documented?

All reasonable complaints!

But it would be nice to have some conceptual clarity (and practical guidance) around the issue.

So let's try that!

Here's what we're going to talk about today:

The theory of data cleaning, which enables
- transforming received data and
- producing internal data;
An imaginary example to illustrate the point; and
A real example: What is clean data for GIS purposes?

So let's start with some definitions.

Data cleaning: Checking for and imposing

either
- semantic virtue (truth) or
- syntactic virtue (formal validity)
within and upon raw data
by means of a documented, repeatable process.

This creates clean data!

Semantic virtue?

In this context, semantic virtue is just truth. It's obvious why we need this!

But I'm not going to talk about it, for a few reasons:

It resists theoretical reduction.
The process of finding it isn't particularly distinct from the rest of the job.
Arguably, it's not even our primary concern (though this is a long digression).

Okay... syntactic virtue, then.

In order to be useful, information doesn't just need to be true; it needs to obey certain constraints of form.

For instance, it should be:

readable and useful
- A blunt needle in hand is more useful than a sharp one in a haystack! (Note that utility is contextual to an analysis)
internally consistent and uniform
- Mixing Fahrenheit and Celsius without warning leads to confusion!
composed of appropriate data types and ranges
- “True” isn't a height; “December 48th” isn't a date.

Even missing data should obey formal constraints!

If there is no answer to a question, is it because:

We don't know the answer? (What is Genghis Khan's birthday?)
The answer violates expectations of the question? (List the people who have walked on Mars.)
There is no answer yet? (What is my date of death?)
There is no answer at all? (What is the name of the King of France in 2024?)

These conditions often need to be represented differently, because they are analyzed differently!

Syntactic virtue continued

These are far from the only formal constraints on data, but you get the idea!

Data needs to be syntactically virtuous in order to be usefully read, analyzed and presented.

Now let's step back to the raw data.

Raw data?

The data as you receive it – you haven't manipulated, summarized, edited, or processed it in any way!

Obviously, it often violates the above formal constraints.

So it needs to be transformed. But it must be transformed reproducibly.

Non-reproducible transformation cannot be distinguished from invention!

But take note...

Transformation is not replacement!

You must preserve your raw data in their original form.

In other words…

Your raw data are important...

But don't open them up and mess around.

Data transformation

It's a good idea to push as much of your data transformation onto computer code as you can.

This is for several reasons!

Avoids idiosyncratic error
Saves time on large datasets
- Your procedures need to optimize staff time!
Can be repeated in similar future cases
- “If you have to do it twice, you'll have to do it many times, so automate it.”

Data transformation

But the most important reason, to my mind, is:

Functional computer code is by definition fully operationalized.

You need a recipe – workpapers that tell you all and only the fully defined operations necessary to produce your clean data from your raw data.

Working code is this recipe.

So what do you get at the end?

Cleaned data ought to have:

Each variable in its own column (and no columns that aren't variables)
Each observation in its own row (and no rows that aren't observations)
Each unit of observation in its own table
Each observation of a given variable within explicit parameters for that variable

These explicit parameters should be documented in a separate file: the data dictionary or code book.

Why?

Well, it's a long story. But basically, this structure guarantees:

Easier computation
More efficient storage
Greater resistance to data corruption by editing

But it's just a block of cells now!

Well, yeah. But repeat to yourself this mantra;

Data storage, analysis, and display are distinct functions!

So let's review.

Or...

The summary.

Data cleaning changes raw data to clean data by means of an explicit operationalized transformation recipe and documents the parameters of the clean data in a codebook or data dictionary.

Practical examples and a short digression

So we have the theory; now let's talk about the practice.

But first, remember: One aspect of syntactic virtue is useful formatting.

And utility is relative to the kind of analysis you're doing!

Which in turn means...

Analytic clarity and clean data form a feedback loop!

The clearer you are about what questions you want to ask, the easier it is to know how your data should be cleaned.

The cleaner your data, the easier it is to answer your questions.

So let's look at an (imaginary) example.

STANDARDIZED URCHIN TESTS	…2
Bob Smallchild
	ADDRESSES
	344 Clinton Ave. Apt 3D
	TEST SCORES
	Chimneysweeping - 38 - July 21
	Sniveling - 402 - January 10
	Chimneysw. - 66 - September 12
Emma McWaif
	ADDRESSES
	221B Baker Street
	Down the street from the old oak, take a left
	TEST SCORES
	Chimneysweeping - 85% - 22/7/21
	Cockney - 77% - 22/7/21

What do we notice about this dataset?

Where to even start?

Spreadsheet and column titles are treated as data.
Rows are not observations
Columns are not variables

And not only that...

Semantic vice abounds:
- Bad values
- Bad formatting
- Conjoined data
- Inconsistent entity nomenclature

But why does all that matter?

Consider the practical difficulties of analysis!

Let's assume we want to do:

synchronic and
diachronic comparisons

test type
geographic loction
and student.

These things are all very difficult in the raw data.

This holds true whether you're working from Excel formulae or doing code-based analysis!

(If the dataset is big enough, manual analysis is prohibitive in any case.)

Let's go back and look.

So remember...

Cleaned data ought to have:

Each variable in its own column
Each observation in its own row
Each observation of a given variable within explicit parameters for that variable

(We will return shortly to the idea that each unit of observation has its own table)

Here's a cleaned version:

FN	LN	Address	Test	Score%	Mon	Day	Yr
Bob	Smallchild	344 Cli…	Chimney…	38.0	7	21	2022
Bob	Smallchild	344 Cli…	Chimney…	66.0	9	12	2022
Bob	Smallchild	344 Cli…	Sniveli…	40.2	1	10	2022
Emma	McWaif	221B Ba…	Chimney…	85.0	7	21	2022
Emma	McWaif	221B Ba…	Cockney	77.0	7	21	2022
Emma	McWaif	BADVAL	Chimney…	85.0	7	21	2022
Emma	McWaif	BADVAL	Cockney	77.0	7	21	2022

Some notes

Obviously I made some transformations that you'd want to back up with research.

Interpretation of an out-of-bounds test score
Treatment of bad values
Interpretation of ambiguous dates

Some notes

This does create visually unwieldy duplication of particular fields…

But remember our mantra: Data storage, analysis, and display are distinct functions!

But if we really want to get clean...

We can return to our old friend the unit of observation.

Units of observation are the entities picked out by each row of data.

But sometimes, they can be decomposed into smaller units!

For example

In our cleaned dataset, the unit of observation is person-test-address-occasions!

(This is why I am not distinguishing units of observation from units of analysis: The distinction can involve ontological gerrymandering.)

But instead, we could decompose them into

Person-addresses...

FN	LN	Address
Bob	Smallchild	344 Clinton Ave. Apt 3D
Emma	McWaif	221B Baker Street
Emma	McWaif	BADVAL

And person-test-occasions!

FN	LN	Test	Score%	Mon	Day	Yr
Bob	Smallchild	Chimney…	38.0	7	21	2022
Bob	Smallchild	Chimney…	66.0	9	12	2022
Bob	Smallchild	Sniveli…	40.2	1	10	2022
Emma	McWaif	Chimney…	85.0	7	21	2022
Emma	McWaif	Cockney	77.0	7	21	2022

This greatly increases efficiency for big datasets...

But the truth is, modern computational power so far outpaces typical file size that for our work, it's mostly not necessary.

If you're producing data instead of cleaning

It's worth thinking:

What kind of sortability do you need? By FN/LN? By address components? Can you consolidate?
Avoid the need for special data types when possible, notably when handling
- money and
- dates (YYYYMMDD is your friend)!

Transforming messy data, creating tidy data

Everybody likes clean data.

But calling data "messy" is often shorthand for real issues!

So let's try that!

So let's start with some definitions.

Semantic virtue?

Okay... syntactic virtue, then.

Even missing data should obey formal constraints!

Syntactic virtue continued

Raw data?

But take note...

Your raw data are important...

But don't open them up and mess around.

Data transformation

Data transformation

So what do you get at the end?

Why?

But it's just a block of cells now!

Or...

The summary.

Practical examples and a short digression

Which in turn means...

So let's look at an (imaginary) example.

What do we notice about this dataset?

And not only that...

Consider the practical difficulties of analysis!

These things are all very difficult in the raw data.

So remember...

Here's a cleaned version:

Some notes

Some notes

But if we really want to get clean...

For example

Person-addresses...

And person-test-occasions!

This greatly increases efficiency for big datasets...

If you're producing data instead of cleaning

The coast

The data

The state

The URL for this presentation

Ex. 1: Recursivity

That's all, folks!