The first example of untidy data concerns a data set that I have created based on my favourite fantasy book series Harry Potter.
In the table below five main characters are displayed in the first column and some characteristics of their magic wands in the second column: the type of wood that their magic wand was made of, the core, the length (in inches) and a spell that would characterize the respective wizard or witch the most. Thus, the second column actually consists of multiple variables, which is a typical case of untidy data. Let’s fix it!
| character_name | wand_characteristics |
|---|---|
| Harry Potter | Holly, Phoenix feather, 11 inches, Expelliarmus |
| Hermione Granger | Vine, Dragon heartstring, 10.75 inches, Alohomora |
| Ron Weasley | Willow, Unicorn hair, 14 inches, Lumos |
| Albus Dumbledore | Elder, Thestral tail hair, 15 inches, Protego Maxima |
| Lord Voldemort (Tom Riddle) | Yew, Phoenix feather, 13.5 inches, Avada Kedavra |
First, I load the necessary R packages.
library(tidyr)
library(readr)
library(dplyr)
Thanks to these super useful R packages we will be able to fix this
untidy data set with just a wave of a wand! In the code chunk below I
specifically stated which R function is part of which R package in the
form R package::R function, so that there is no doubt to
which R package the R function belongs that is being used in the code.
Firstly, I separate the information in the variable
wand characteristics into the four columns
wood, core, length (in inches)
and signature spell using the
tidyr::separate() function. The variables are separated by
commas, so it is necessary to specify the sep = ", "
argument (note the space after the comma). Secondly, I use the
dplyr::mutate() and readr::parse_number
functions to change the length (in inches) column into a
numeric object.
HP_tidy <- tidyr::separate(HP_untidy, wand_characteristics, into = c("wood", "core",
"length_in_inches", "signature_spell"), sep = ", ")
HP_tidy <- HP_tidy %>% dplyr::mutate(length_in_inches = readr::parse_number(length_in_inches))
The resulting table is shown below. Easy peasy!
| character_name | wood | core | length_in_inches | signature_spell |
|---|---|---|---|---|
| Harry Potter | Holly | Phoenix feather | 11.00 | Expelliarmus |
| Hermione Granger | Vine | Dragon heartstring | 10.75 | Alohomora |
| Ron Weasley | Willow | Unicorn hair | 14.00 | Lumos |
| Albus Dumbledore | Elder | Thestral tail hair | 15.00 | Protego Maxima |
| Lord Voldemort (Tom Riddle) | Yew | Phoenix feather | 13.50 | Avada Kedavra |
Now the data is tidy, we can start to analyze it. Look for example at
the diagram below.
This example shows in a simple fashion how important it is to make your data tidy to be able to answer questions of your interest. For example, who of the five main characters described in the data set above has the longest magic wand? It happens to be Albus Dumbledore, 10 points for Gryffindor!