What is Tidy Data?

Example 1: multiple variables in a single column️

The first example of untidy data concerns a data set that I have created based on my favourite fantasy book series Harry Potter.

In the table below five main characters are displayed in the first column and some characteristics of their magic wands in the second column: the type of wood that their magic wand was made of, the core, the length (in inches) and a spell that would characterize the respective wizard or witch the most. Thus, the second column actually consists of multiple variables, which is a typical case of untidy data. Let’s fix it!

character_name wand_characteristics
Harry Potter Holly, Phoenix feather, 11 inches, Expelliarmus
Hermione Granger Vine, Dragon heartstring, 10.75 inches, Alohomora
Ron Weasley Willow, Unicorn hair, 14 inches, Lumos
Albus Dumbledore Elder, Thestral tail hair, 15 inches, Protego Maxima
Lord Voldemort (Tom Riddle) Yew, Phoenix feather, 13.5 inches, Avada Kedavra

First, I load the necessary R packages.

library(tidyr)
library(readr)
library(dplyr)

Thanks to these super useful R packages we will be able to fix this untidy data set with just a wave of a wand! In the code chunk below I specifically stated which R function is part of which R package in the form R package::R function, so that there is no doubt to which R package the R function belongs that is being used in the code. Firstly, I separate the information in the variable wand characteristics into the four columns wood, core, length (in inches) and signature spell using the tidyr::separate() function. The variables are separated by commas, so it is necessary to specify the sep = ", " argument (note the space after the comma). Secondly, I use the dplyr::mutate() and readr::parse_number functions to change the length (in inches) column into a numeric object.

HP_tidy <- tidyr::separate(HP_untidy, wand_characteristics, into = c("wood", "core", 
                                "length_in_inches", "signature_spell"), sep = ", ")
HP_tidy <- HP_tidy %>% dplyr::mutate(length_in_inches = readr::parse_number(length_in_inches))

The resulting table is shown below. Easy peasy!

character_name wood core length_in_inches signature_spell
Harry Potter Holly Phoenix feather 11.00 Expelliarmus
Hermione Granger Vine Dragon heartstring 10.75 Alohomora
Ron Weasley Willow Unicorn hair 14.00 Lumos
Albus Dumbledore Elder Thestral tail hair 15.00 Protego Maxima
Lord Voldemort (Tom Riddle) Yew Phoenix feather 13.50 Avada Kedavra

Now the data is tidy, we can start to analyze it. Look for example at the diagram below.

This example shows in a simple fashion how important it is to make your data tidy to be able to answer questions of your interest. For example, who of the five main characters described in the data set above has the longest magic wand? It happens to be Albus Dumbledore, 10 points for Gryffindor!