Cleaning messy data

“How do I ~~love~~ clean thee? Let me count the ways.”

-Elizabeth Barrett Browning (Sonnet 43), my edits in italics

There are already wonderfully-written, comprehensive guides about cleaning and managing data. In contrast, my goal is to offer a few ideas I find especially important when it comes to data cleaning.

Get a sense of the data itself (get that metadata!)

Ask yourself:

Who collected this data?
What do the rows and columns represent?
- Is each row an animal sighting, or a day’s summary of animal sightings?
- What does the column “Mean” mean? Are there any units?
What do we expect to see vs. what is present?
- Is it truly realistic that 5 months’ worth of data are missing, or might that signal that only a portion of the data was sent?

Archive the original data

Before you go any further, save a copy of the original data and do not touch it. I like to save the original data in a folder called “data_raw”. Once I’m done cleaning the data, I’ll save the cleaned version in a folder called, “data_clean”, then read in that data to be analyzed downstream.

Clean up data within R

Organize unruly column names

When Jenny Bryan, R engineer and instructor extraordinaire, says something, it’s a good idea to listen.

She explains how to name files well and why it’s worthwhile to do so in her 2022 NormConf flash talk. While filenames and column names aren’t identical, her talk is very relevant to this section about naming columns well. Especially this slide:

Alt text: A slide with bolded text: Pick a filename/ any convention/ Just pick one. There are two feet in the corner, each with an arrow above it pointing a different direction.

Sometimes you have to clean data that contains columns with naming conventions as unique as snowflakes. Let’s make a chaotic dataframe with data about the characters in Little Women and then clean it with the wonderful janitor package.

install.package("janitor")

library(janitor)
library(magrittr) # load this to use pipes like this %>%

little_women <- data.frame(
  # Characters' names
  characterNames = c("Jo", "Beth", "Amy", "Meg", "Timothee"),
  
  # Characters' ages
  áge = c(15, 13, 12, 16, 84),
  
  # Characters' ages in dog years
  'Age (Dog years)' = c(105, 91, 83, 112, 588),
  
  # Personality traits
  PERSONALITYTrait = c("impatient", "musical", "artistic", "responsible", NA),
  
  # An empty column like those commonly imported from Excel
  emptyROW = c(NA, NA, NA, NA, NA)
)

head(little_women)

##   characterNames áge Age..Dog.years. PERSONALITYTrait emptyROW
## 1             Jo  15             105        impatient       NA
## 2           Beth  13              91          musical       NA
## 3            Amy  12              83         artistic       NA
## 4            Meg  16             112      responsible       NA
## 5       Timothee  84             588             <NA>       NA

Now, clean those unruly column names with janitor::clean_names():

little_women %<>% 
  janitor::clean_names() %T>%
  print()

##   character_names age age_dog_years personality_trait empty_row
## 1              Jo  15           105         impatient        NA
## 2            Beth  13            91           musical        NA
## 3             Amy  12            83          artistic        NA
## 4             Meg  16           112       responsible        NA
## 5        Timothee  84           588              <NA>        NA

Now, our columns are easily read by both humans and machines. They are also named consistently, with a lowercase & snake_case convention (i.e., words are separated with underscores, _).

Remove empty and duplicate rows/columns

Let’s use another janitor function, remove_empty(), to automatically remove columns and rows that are empty. I often encounter empty rows/columns when importing csvs from Excel.

little_women %<>% 
  janitor::remove_empty(c("rows", "cols")) %T>%
  print()

##   character_names age age_dog_years personality_trait
## 1              Jo  15           105         impatient
## 2            Beth  13            91           musical
## 3             Amy  12            83          artistic
## 4             Meg  16           112       responsible
## 5        Timothee  84           588              <NA>

One more function worth exploring from janitor is get_dupes(). This will alert us about any duplicate rows within our dataset.

First, duplicate each row:

little_women %<>%
  dplyr::slice(rep(1:5, each = 2))

Then, ask get_dupes() to find them:

janitor::get_dupes(little_women)

## No variable names specified - using all columns.

##    character_names age age_dog_years personality_trait dupe_count
## 1              Amy  12            83          artistic          2
## 2              Amy  12            83          artistic          2
## 3             Beth  13            91           musical          2
## 4             Beth  13            91           musical          2
## 5               Jo  15           105         impatient          2
## 6               Jo  15           105         impatient          2
## 7              Meg  16           112       responsible          2
## 8              Meg  16           112       responsible          2
## 9         Timothee  84           588              <NA>          2
## 10        Timothee  84           588              <NA>          2

Finally, keep only unique rows:

little_women %<>%
  unique() %T>%
  print()

##   character_names age age_dog_years personality_trait
## 1              Jo  15           105         impatient
## 3            Beth  13            91           musical
## 5             Amy  12            83          artistic
## 7             Meg  16           112       responsible
## 9        Timothee  84           588              <NA>

Explore the data

Validate the data

This means making sure that your “Year” column contains values between 1900-2000, for example. Confirming that “Distance_meters” = “Distance_feet” x 0.3048.

Let’s use the validate package.

First, install and load the package:

install.package("validate")

library(validate)

Next, validate that the dataset is set up properly:

Their names and traits should be of class “character”. They’re little women, so their ages should be below… let’s say 20. Their ages in dog years should be equal to their age x 7.

checks <- validator(class(character_names) == "character",
                 age < 20,
                 age_dog_years == age*7,
                 class(personality_trait) == "character")

# perform the validations
checked_lw <- validate::confront(little_women, checks)

# see a summary of the results
summary(checked_lw)

##   name items passes fails nNA error warning
## 1   V1     1      1     0   0 FALSE   FALSE
## 2   V2     5      4     1   0 FALSE   FALSE
## 3   V3     5      4     1   0 FALSE   FALSE
## 4   V4     1      1     0   0 FALSE   FALSE
##                                expression
## 1   class(character_names) == "character"
## 2                                age < 20
## 3   abs(age_dog_years - age * 7) <= 1e-08
## 4 class(personality_trait) == "character"

Here’s what our summary says:

Name: Name of each rule.
Items: Number of items assessed per rule.
Passes/Fails: Number of times the items passed vs. failed the rules. Passes + Fails = Items.
Error/Warning: If, during the validation process, there was an error/warning.
Expression: The expression used to check the rule.

We can see that our first and fourth rules were 1 for 1- i.e., the character_names and personality_trait columns were actually of the class “character” (good!). For rules 2 and 3, one item didn’t pass the test each time. Indeed, the age of Timothee is not < 20 and Amy’s age in dog years isn’t her real age x 7- it’s slightly off.

We can even plot these results! Nifty, eh?

plot(checked_lw)

We can also identify the entries that broke the rules.

Option 1: see entire data frame

as.data.frame(checked_lw) %>%
# exclude entries 1 and 4 because they are spotless
   dplyr::filter(name != "V1" & name != "V4")

##    name value                            expression
## 11   V2  TRUE                              age < 20
## 3    V2  TRUE                              age < 20
## 5    V2  TRUE                              age < 20
## 7    V2  TRUE                              age < 20
## 9    V2 FALSE                              age < 20
## 12   V3  TRUE abs(age_dog_years - age * 7) <= 1e-08
## 31   V3  TRUE abs(age_dog_years - age * 7) <= 1e-08
## 51   V3 FALSE abs(age_dog_years - age * 7) <= 1e-08
## 71   V3  TRUE abs(age_dog_years - age * 7) <= 1e-08
## 91   V3  TRUE abs(age_dog_years - age * 7) <= 1e-08

# then visually assess based on the "value" column's TRUE/FALSE.

Option 2: Let validate show us the “violating” rows

violating(little_women, # original df
          checked_lw[2:3]) # evaluated df, filtered by rules w/violations that can be assessed per row.

##   character_names age age_dog_years personality_trait
## 5             Amy  12            83          artistic
## 9        Timothee  84           588              <NA>

Rules 2 and 3 were broken. If Rule 1 had been violated, this step wouldn’t be possible/necessary because asking about a variable’s class is something you ask once- you don’t assess it on a row-by-row basis.

Check out other types of validations that Meghan Harris covered in her excellent presentation at RStudio::conf(2022):

Alt text: A slide with the title, “Examples of Validations Types I’ve personally used”. There are eight circles with different data validations: Variable type, allowed/expected values, missing data, logic checks, uniqueness, string length, formatting, range checks.

Summarize the data

Let’s correct the values we flagged with validate, then look at a simple summary of the data.

little_women[5, "personality_trait"] <- "angsty"
little_women[5, "age"] <- 15
little_women[5, "age_dog_years"] <- 15*7
little_women[3, "age_dog_years"] <- 12*7

summary(little_women)

##  character_names         age       age_dog_years   personality_trait 
##  Length:5           Min.   :12.0   Min.   : 84.0   Length:5          
##  Class :character   1st Qu.:13.0   1st Qu.: 91.0   Class :character  
##  Mode  :character   Median :15.0   Median :105.0   Mode  :character  
##                     Mean   :14.2   Mean   : 99.4                     
##                     3rd Qu.:15.0   3rd Qu.:105.0                     
##                     Max.   :16.0   Max.   :112.0

The character columns’ summaries offer information about the length, class, and mode of the column. For the numeric columns, we see quantitative summaries regarding max and mins, medians and means, 1st and 3rd quantiles. This is another way to find possible errors in the data.

Another easy way to get a sense of your data is to plot it.

Plot the data

Let’s use a much richer dataset from the gapminder package. I guess it deviates from this tutorial’s literature theme, but hey, some of the great novels were written during the years included in this dataset!

install.package("gapminder")

library(gapminder)

We’ll save a local copy of the dataset, called “gap_data”, then explore it.

gap_data <- gapminder
str(gap_data)

## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

head(gap_data)

## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

This dataset contains information about the life expectancy, population, and GDP per capita for many countries over many years. How many countries? How many years?

First: how many countries?

length(unique(gap_data$country))

## [1] 142

How many continents?

length(unique(gap_data$continent))

## [1] 5

During which years?

range(gap_data$year)

## [1] 1952 2007

But only 12 years:

length(unique(gap_data$year))

## [1] 12

Specifically, these years. Two per decade, it appears:

unique(gap_data$year)

##  [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007

Let’s explore the data visually. We can look for outliers, trends, and more to familiarize ourselves with the data. We’ll start by using the base R scatterplot matrix function, pairs(), to plot all five variables in the dataset against one another.

gap_data %>%
  pairs()

We can start to see some trends here, like that year and life expectancy seem to have a positive relationship. However, there’s a lot of data here, and each country has many entries (12 years of data each). So let’s filter a little, then replot.

We’ll select Bolivia as the country and exclude the country and continent variables. We had to be very careful with those variables because they are categorical, not numeric, like the remaining variables.

gap_data %>%
  dplyr::filter(country == "Bolivia") %>%
  dplyr::select(year, lifeExp, pop, gdpPercap) %>%
  pairs()

Very nice. Now we can see some trends more clearly, like that year, life expectancy, and population have pretty clear positive relationships. GDP per capita isn’t as tightly correlated with these factors (the R squared value is likely lower), but it also looks like there’s a positive relationship with the other variables.

If you had entered this data by hand, this plotting exercise could help point out that you’d entered 3000 instead of 2000 for year or had missed a few entries, for instance. It also informs you of the patterns you should expect, developing your instincts to spot errors before you get in too deep.

Cleaning Messy Data with R

Spotting messy data

What can messy data look like?