Tidy Data Principles

Once you’ve imported and understand the structure of your data, it is a good idea to tidy it. Tidying your data means storing it in a consistent form that matches the semantics of the data set with the way it is stored.

There are three interrelated rules which make a dataset tidy (Hadley Wickham and Grolemund (2016)). In tidy data:

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

Untidy Data Example

Here, variables do not all have their own columns. First variable is “Student Name”, the second is “Subject” that represents whether the subject is Maths or English, and the third one is the “Grade” information inside the data matrix.

Tidy Data Example

You can see that in this format, each variable forms a column and each student forms a row.

Common Problems

Column headers as values, not variable names

A common problem is a dataset where some (or all) of the column names are not names of variables, but values of a variable. Here is an illustration of this problem:

Multiple variables are stored in rows

The opposite of the first problem can also occur when the variables are stored in rows. In such cases, cells include the actual variables, not the observations. Here is an example:

Multiple variables are stored in one column

Sometimes, one column stores the information of two or more variables. Therefore, multiple variables can be extracted from one column. Here is an illustration of this problem:

Mutliple columns forms a variable

You may need to combine multiple columns into a single column to form a new variable. Here is an illustration of this problem:

The tidyr Package

Make sure the package is loaded and installed.

#install.packages("tidyr")
library(tidyr)

Long vs Wide Data

One of the ways is called “long format”. In this layout, the data set is arranged in such a way that a single subject’s information is stored in multiple rows.

In the wide format, a single subject’s information is stored in multiple columns.

Convert Wide to Long

When column names are values instead of variables, we need to gather or in other words, we need to transform data from wide to long format.

Here is the theory.

# gather(data, name of new “key column”, name of new “value column”, columns to collapse) 

Here is an example.

# gather(data, Year, n, 2:4)

Alternate ways to collapse columns

  • column_name_start:column_name_end (Qtr.1:Qtr.4)
  • -columns_we_dont_want_to_collapse (minus columns)
  • column_#start:column#_end (3:6)
  • list_all_columns (Qtr.1, Qtr.2, Qtr.3, Qtr.4)

Convert Long to Wide

When multiple variables are stored in rows, the spread() function generates columns from rows. In other words, it transforms data from long to wide format. The spread() function is the opposite of gather() function.

Here is the theory.

# spread(data, name of column to spread, column to use as values in those spread columns) 

Here is an example.

# spread(data, Year, n)

Here is another example.

Split/Combine Variables

Split Variables

Here is the theory.

# separate(data, column to separate, column names to put into, separator)

Here is an example.

# separate(data, date, c(“year”, “month”, “day”), sep=”-”)

Here is another example.

# separate(table3, rate, c(“cases”, “population”), sep=”/”)

Combine Variables

unite() is the inverse of separate() function. One can use it to combine multiple columns into a single column.

Here is the theory.

# unite(data, new column to form, columns to unite, separator)

Here is an example.

# unite(data, date, year, month, day, sep=”-”)

Here is another example.

# unite(table5, new_year, century, year, sep=””)