PEER staff
09/04/2024
And nobody likes messy data!
Sometimes it seems like most of the data we get is the latter!
These can be just words of approval or disapproval…
Maybe the data:
All reasonable complaints!
But it would be nice to have some conceptual clarity (and practical guidance) around the issue.
Here's what we're going to talk about today:
Data cleaning: Checking for and imposing
This creates clean data!
In this context, semantic virtue is just truth. It's obvious why we need this!
But I'm not going to talk about it, for a few reasons:
In order to be useful, information doesn't just need to be true; it needs to obey certain constraints of form.
For instance, it should be:
If there is no answer to a question, is it because:
These conditions often need to be represented differently, because they are analyzed differently!
These are far from the only formal constraints on data, but you get the idea!
Data needs to be syntactically virtuous in order to be usefully read, analyzed and presented.
Now let's step back to the raw data.
The data as you receive it – you haven't manipulated, summarized, edited, or processed it in any way!
Obviously, it often violates the above formal constraints.
So it needs to be transformed. But it must be transformed reproducibly.
Non-reproducible transformation cannot be distinguished from invention!
Transformation is not replacement!
You must preserve your raw data in their original form.
In other words…
It's a good idea to push as much of your data transformation onto computer code as you can.
This is for several reasons!
But the most important reason, to my mind, is:
You need a recipe – workpapers that tell you all and only the fully defined operations necessary to produce your clean data from your raw data.
Working code is this recipe.
Cleaned data ought to have:
These explicit parameters should be documented in a separate file: the data dictionary or code book.
Well, it's a long story. But basically, this structure guarantees:
Well, yeah. But repeat to yourself this mantra;
Data storage, analysis, and display are distinct functions!
So let's review.
Data cleaning changes raw data to clean data by means of an explicit operationalized transformation recipe and documents the parameters of the clean data in a codebook or data dictionary.
So we have the theory; now let's talk about the practice.
But first, remember: One aspect of syntactic virtue is useful formatting.
And utility is relative to the kind of analysis you're doing!
Analytic clarity and clean data form a feedback loop!
The clearer you are about what questions you want to ask, the easier it is to know how your data should be cleaned.
The cleaner your data, the easier it is to answer your questions.
| STANDARDIZED URCHIN TESTS | …2 |
|---|---|
| Bob Smallchild | |
| ADDRESSES | |
| 344 Clinton Ave. Apt 3D | |
| TEST SCORES | |
| Chimneysweeping - 38 - July 21 | |
| Sniveling - 402 - January 10 | |
| Chimneysw. - 66 - September 12 | |
| Emma McWaif | |
| ADDRESSES | |
| 221B Baker Street | |
| Down the street from the old oak, take a left | |
| TEST SCORES | |
| Chimneysweeping - 85% - 22/7/21 | |
| Cockney - 77% - 22/7/21 |
Where to even start?
But why does all that matter?
Let's assume we want to do:
by
This holds true whether you're working from Excel formulae or doing code-based analysis!
(If the dataset is big enough, manual analysis is prohibitive in any case.)
Let's go back and look.
Cleaned data ought to have:
(We will return shortly to the idea that each unit of observation has its own table)
| FN | LN | Address | Test | Score% | Mon | Day | Yr |
|---|---|---|---|---|---|---|---|
| Bob | Smallchild | 344 Cli… | Chimney… | 38.0 | 7 | 21 | 2022 |
| Bob | Smallchild | 344 Cli… | Chimney… | 66.0 | 9 | 12 | 2022 |
| Bob | Smallchild | 344 Cli… | Sniveli… | 40.2 | 1 | 10 | 2022 |
| Emma | McWaif | 221B Ba… | Chimney… | 85.0 | 7 | 21 | 2022 |
| Emma | McWaif | 221B Ba… | Cockney | 77.0 | 7 | 21 | 2022 |
| Emma | McWaif | BADVAL | Chimney… | 85.0 | 7 | 21 | 2022 |
| Emma | McWaif | BADVAL | Cockney | 77.0 | 7 | 21 | 2022 |
Obviously I made some transformations that you'd want to back up with research.
This does create visually unwieldy duplication of particular fields…
But remember our mantra: Data storage, analysis, and display are distinct functions!
We can return to our old friend the unit of observation.
Units of observation are the entities picked out by each row of data.
But sometimes, they can be decomposed into smaller units!
In our cleaned dataset, the unit of observation is person-test-address-occasions!
(This is why I am not distinguishing units of observation from units of analysis: The distinction can involve ontological gerrymandering.)
But instead, we could decompose them into
| FN | LN | Address |
|---|---|---|
| Bob | Smallchild | 344 Clinton Ave. Apt 3D |
| Emma | McWaif | 221B Baker Street |
| Emma | McWaif | BADVAL |
| FN | LN | Test | Score% | Mon | Day | Yr |
|---|---|---|---|---|---|---|
| Bob | Smallchild | Chimney… | 38.0 | 7 | 21 | 2022 |
| Bob | Smallchild | Chimney… | 66.0 | 9 | 12 | 2022 |
| Bob | Smallchild | Sniveli… | 40.2 | 1 | 10 | 2022 |
| Emma | McWaif | Chimney… | 85.0 | 7 | 21 | 2022 |
| Emma | McWaif | Cockney | 77.0 | 7 | 21 | 2022 |
But the truth is, modern computational power so far outpaces typical file size that for our work, it's mostly not necessary.
It's worth thinking:
So now let's move from theory into reality!
Questions?