tidyr Introduction

Matt Pettis
2017-11-30

matthew.pettis@gmail.com

Rule 34 of Data Analysis

If there is a perverse way of storing data, someone has done it, and you will have to deal with it one day.

Rule 34 of Data Analysis

If there is a perverse way of storing data, someone has done it, and you will have to deal with it one day.

alt text

Me, when I look at someone's custom built Excel file wheren they have “just learned how awesome VBScript is!”

Two ways of storing tabular data

Left is wide format, right is long format.
Columns X1 -- X3 in wide format become values under key in long format.
Values under columns X1 -- X3 in wide format become the values under the value column in long format.

Why do we care about different formats?

Some packages want data in wide formats, while others want it in long format.
Data can be stored in a form that locks up important information in the column name, and we want that accessible as data in a column cell.
We want a display of data that is easier to look at but not easier to process.

Doing it the hard way

Doing it the hard way

df_long <- bind_rows(
  data.frame(DateTime = df_wide$DateTime, key="X1", value=df_wide$X1, stringsAsFactors = FALSE)
  , data.frame(DateTime = df_wide$DateTime, key="X2", value=df_wide$X2, stringsAsFactors = FALSE)
  , data.frame(DateTime = df_wide$DateTime, key="X3", value=df_wide$X3, stringsAsFactors = FALSE))

Doing it the hard way

df_wide2 <- merge(
  setNames(df_long[df_long$key == "X1",c("DateTime", "value")], c("DateTime", "X1"))
  , setNames(df_long[df_long$key == "X2",c("DateTime", "value")], c("DateTime", "X2"))
  , by = "DateTime")

df_wide2 <- merge(
  df_wide2
  , setNames(df_long[df_long$key == "X3",c("DateTime", "value")], c("DateTime", "X3"))
  , by = "DateTime")

Doing it the hard way

I make it look like this:

… but other, native solutions aren't really much better.

Doing it the easy, tidyr way

df_wide_tidy <- spread(df_long, key, value)

Doing it the easy, tidyr way

df_long_tidy <- gather(df_wide, key, value, c("X1", "X2", "X3"))

#df_long_tidy <- gather(df_wide, key, value, -DateTime) # Alternate
#df_long_tidy <- gather(df_wide, key, value, X1:X3)     # Alternate

Other tidyr functionality

Split a column of cells, that has delimeters, into multiple columns.
… or multiple rows.
Add rows that represent missing combinations of observations and features.

Reference Links

Tidy Data, the Hadley Wickham paper that lays out his philosophy about these things, data formats, and his first two packages that did this work (reshape and reshape2) : http://vita.had.co.nz/papers/tidy-data.pdf
Main documentation page for the tidyr package: http://tidyr.tidyverse.org/
The RMarkdown file with code: https://gist.github.com/mpettis/595e2f5f0fd4b7f37bae29f09c9c1280
38 animated gifs of people dramatically failing for the purpose of infomercials: https://www.buzzfeed.com/julianbrand/40-gifs-of-stupid-infomercial-people-6eof?utm_term=.pkxql4k0n

Questions, and Thank You

Matt Pettis

email: matthew.pettis@gmail.com

github: mpettis