“Happy
familiesdatasets are all alike; every unhappyfamilydataset is unhappy in its own way.”
-Leo Tolstoy (Anna Karenina), my edits in italics
Like Tolstoy implied (sort of), messy data can take one of infinite forms. I’ll list a few ways as a warm-up:
Check out “The Quartz guide to bad data” for an admirably comprehensive list of forms that messy data can take.
“How do I
loveclean thee? Let me count the ways.”
-Elizabeth Barrett Browning (Sonnet 43), my edits in italics
There are already wonderfully-written, comprehensive guides about cleaning and managing data. In contrast, my goal is to offer a few ideas I find especially important when it comes to data cleaning.
Ask yourself:
Who collected this data?
What do the rows and columns represent?
What do we expect to see vs. what is present?
Before you go any further, save a copy of the original data and do not touch it. I like to save the original data in a folder called “data_raw”. Once I’m done cleaning the data, I’ll save the cleaned version in a folder called, “data_clean”, then read in that data to be analyzed downstream.
When Jenny Bryan, R engineer and instructor extraordinaire, says something, it’s a good idea to listen.
She explains how to name files well and why it’s worthwhile to do so in her 2022 NormConf flash talk. While filenames and column names aren’t identical, her talk is very relevant to this section about naming columns well. Especially this slide:
Alt text: A slide with bolded text: Pick a filename/ any convention/ Just pick one. There are two feet in the corner, each with an arrow above it pointing a different direction.
Sometimes you have to clean data that contains columns with naming conventions as unique as snowflakes. Let’s make a chaotic dataframe with data about the characters in Little Women and then clean it with the wonderful janitor package.
install.package("janitor")
library(janitor)
library(magrittr) # load this to use pipes like this %>%
little_women <- data.frame(
# Characters' names
characterNames = c("Jo", "Beth", "Amy", "Meg", "Timothee"),
# Characters' ages
áge = c(15, 13, 12, 16, 84),
# Characters' ages in dog years
'Age (Dog years)' = c(105, 91, 83, 112, 588),
# Personality traits
PERSONALITYTrait = c("impatient", "musical", "artistic", "responsible", NA),
# An empty column like those commonly imported from Excel
emptyROW = c(NA, NA, NA, NA, NA)
)
head(little_women)
## characterNames áge Age..Dog.years. PERSONALITYTrait emptyROW
## 1 Jo 15 105 impatient NA
## 2 Beth 13 91 musical NA
## 3 Amy 12 83 artistic NA
## 4 Meg 16 112 responsible NA
## 5 Timothee 84 588 <NA> NA
Now, clean those unruly column names with janitor::clean_names():
little_women %<>%
janitor::clean_names() %T>%
print()
## character_names age age_dog_years personality_trait empty_row
## 1 Jo 15 105 impatient NA
## 2 Beth 13 91 musical NA
## 3 Amy 12 83 artistic NA
## 4 Meg 16 112 responsible NA
## 5 Timothee 84 588 <NA> NA
Now, our columns are easily read by both humans and machines. They are also named consistently, with a lowercase & snake_case convention (i.e., words are separated with underscores, _).
Let’s use another janitor function, remove_empty(), to automatically remove columns and rows that are empty. I often encounter empty rows/columns when importing csvs from Excel.
little_women %<>%
janitor::remove_empty(c("rows", "cols")) %T>%
print()
## character_names age age_dog_years personality_trait
## 1 Jo 15 105 impatient
## 2 Beth 13 91 musical
## 3 Amy 12 83 artistic
## 4 Meg 16 112 responsible
## 5 Timothee 84 588 <NA>
One more function worth exploring from janitor is get_dupes(). This will alert us about any duplicate rows within our dataset.
First, duplicate each row:
little_women %<>%
dplyr::slice(rep(1:5, each = 2))
Then, ask get_dupes() to find them:
janitor::get_dupes(little_women)
## No variable names specified - using all columns.
## character_names age age_dog_years personality_trait dupe_count
## 1 Amy 12 83 artistic 2
## 2 Amy 12 83 artistic 2
## 3 Beth 13 91 musical 2
## 4 Beth 13 91 musical 2
## 5 Jo 15 105 impatient 2
## 6 Jo 15 105 impatient 2
## 7 Meg 16 112 responsible 2
## 8 Meg 16 112 responsible 2
## 9 Timothee 84 588 <NA> 2
## 10 Timothee 84 588 <NA> 2
Finally, keep only unique rows:
little_women %<>%
unique() %T>%
print()
## character_names age age_dog_years personality_trait
## 1 Jo 15 105 impatient
## 3 Beth 13 91 musical
## 5 Amy 12 83 artistic
## 7 Meg 16 112 responsible
## 9 Timothee 84 588 <NA>
This means making sure that your “Year” column contains values between 1900-2000, for example. Confirming that “Distance_meters” = “Distance_feet” x 0.3048.
Let’s use the validate package.
First, install and load the package:
install.package("validate")
library(validate)
Next, validate that the dataset is set up properly:
Their names and traits should be of class “character”. They’re little women, so their ages should be below… let’s say 20. Their ages in dog years should be equal to their age x 7.
checks <- validator(class(character_names) == "character",
age < 20,
age_dog_years == age*7,
class(personality_trait) == "character")
# perform the validations
checked_lw <- validate::confront(little_women, checks)
# see a summary of the results
summary(checked_lw)
## name items passes fails nNA error warning
## 1 V1 1 1 0 0 FALSE FALSE
## 2 V2 5 4 1 0 FALSE FALSE
## 3 V3 5 4 1 0 FALSE FALSE
## 4 V4 1 1 0 0 FALSE FALSE
## expression
## 1 class(character_names) == "character"
## 2 age < 20
## 3 abs(age_dog_years - age * 7) <= 1e-08
## 4 class(personality_trait) == "character"
Here’s what our summary says:
We can see that our first and fourth rules were 1 for 1- i.e., the character_names and personality_trait columns were actually of the class “character” (good!). For rules 2 and 3, one item didn’t pass the test each time. Indeed, the age of Timothee is not < 20 and Amy’s age in dog years isn’t her real age x 7- it’s slightly off.
We can even plot these results! Nifty, eh?
plot(checked_lw)
We can also identify the entries that broke the rules.
Option 1: see entire data frame
as.data.frame(checked_lw) %>%
# exclude entries 1 and 4 because they are spotless
dplyr::filter(name != "V1" & name != "V4")
## name value expression
## 11 V2 TRUE age < 20
## 3 V2 TRUE age < 20
## 5 V2 TRUE age < 20
## 7 V2 TRUE age < 20
## 9 V2 FALSE age < 20
## 12 V3 TRUE abs(age_dog_years - age * 7) <= 1e-08
## 31 V3 TRUE abs(age_dog_years - age * 7) <= 1e-08
## 51 V3 FALSE abs(age_dog_years - age * 7) <= 1e-08
## 71 V3 TRUE abs(age_dog_years - age * 7) <= 1e-08
## 91 V3 TRUE abs(age_dog_years - age * 7) <= 1e-08
# then visually assess based on the "value" column's TRUE/FALSE.
Option 2: Let validate show us the “violating” rows
violating(little_women, # original df
checked_lw[2:3]) # evaluated df, filtered by rules w/violations that can be assessed per row.
## character_names age age_dog_years personality_trait
## 5 Amy 12 83 artistic
## 9 Timothee 84 588 <NA>
Rules 2 and 3 were broken. If Rule 1 had been violated, this step wouldn’t be possible/necessary because asking about a variable’s class is something you ask once- you don’t assess it on a row-by-row basis.
Check out other types of validations that Meghan Harris covered in her excellent presentation at RStudio::conf(2022):
Alt text: A slide with the title, “Examples of Validations Types I’ve personally used”. There are eight circles with different data validations: Variable type, allowed/expected values, missing data, logic checks, uniqueness, string length, formatting, range checks.
Let’s correct the values we flagged with validate, then look at a simple summary of the data.
little_women[5, "personality_trait"] <- "angsty"
little_women[5, "age"] <- 15
little_women[5, "age_dog_years"] <- 15*7
little_women[3, "age_dog_years"] <- 12*7
summary(little_women)
## character_names age age_dog_years personality_trait
## Length:5 Min. :12.0 Min. : 84.0 Length:5
## Class :character 1st Qu.:13.0 1st Qu.: 91.0 Class :character
## Mode :character Median :15.0 Median :105.0 Mode :character
## Mean :14.2 Mean : 99.4
## 3rd Qu.:15.0 3rd Qu.:105.0
## Max. :16.0 Max. :112.0
The character columns’ summaries offer information about the length, class, and mode of the column. For the numeric columns, we see quantitative summaries regarding max and mins, medians and means, 1st and 3rd quantiles. This is another way to find possible errors in the data.
Another easy way to get a sense of your data is to plot it.
Let’s use a much richer dataset from the gapminder package. I guess it deviates from this tutorial’s literature theme, but hey, some of the great novels were written during the years included in this dataset!
install.package("gapminder")
library(gapminder)
We’ll save a local copy of the dataset, called “gap_data”, then explore it.
gap_data <- gapminder
str(gap_data)
## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
head(gap_data)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
This dataset contains information about the life expectancy, population, and GDP per capita for many countries over many years. How many countries? How many years?
First: how many countries?
length(unique(gap_data$country))
## [1] 142
How many continents?
length(unique(gap_data$continent))
## [1] 5
During which years?
range(gap_data$year)
## [1] 1952 2007
But only 12 years:
length(unique(gap_data$year))
## [1] 12
Specifically, these years. Two per decade, it appears:
unique(gap_data$year)
## [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Let’s explore the data visually. We can look for outliers, trends, and more to familiarize ourselves with the data. We’ll start by using the base R scatterplot matrix function, pairs(), to plot all five variables in the dataset against one another.
gap_data %>%
pairs()
We can start to see some trends here, like that year and life expectancy seem to have a positive relationship. However, there’s a lot of data here, and each country has many entries (12 years of data each). So let’s filter a little, then replot.
We’ll select Bolivia as the country and exclude the country and continent variables. We had to be very careful with those variables because they are categorical, not numeric, like the remaining variables.
gap_data %>%
dplyr::filter(country == "Bolivia") %>%
dplyr::select(year, lifeExp, pop, gdpPercap) %>%
pairs()
Very nice. Now we can see some trends more clearly, like that year, life expectancy, and population have pretty clear positive relationships. GDP per capita isn’t as tightly correlated with these factors (the R squared value is likely lower), but it also looks like there’s a positive relationship with the other variables.
If you had entered this data by hand, this plotting exercise could help point out that you’d entered 3000 instead of 2000 for year or had missed a few entries, for instance. It also informs you of the patterns you should expect, developing your instincts to spot errors before you get in too deep.
These are some of the key ideas I think about while cleaning data. What goes through your mind when you receive a new dataset? Do you have any favorite tips or tricks for familiarizing yourself with or cleaning the data? Feel free to let me know! There’s always something new to learn!