tidyr Introduction

Matt Pettis
2017-11-30

Rule 34 of Data Analysis

  • If there is a perverse way of storing data, someone has done it, and you will have to deal with it one day.

Rule 34 of Data Analysis

  • If there is a perverse way of storing data, someone has done it, and you will have to deal with it one day.

alt text

  • Me, when I look at someone's custom built Excel file wheren they have “just learned how awesome VBScript is!”

Two ways of storing tabular data

  • Left is wide format, right is long format.
  • Columns X1 -- X3 in wide format become values under key in long format.
  • Values under columns X1 -- X3 in wide format become the values under the value column in long format.

Why do we care about different formats?

  • Some packages want data in wide formats, while others want it in long format.
  • Data can be stored in a form that locks up important information in the column name, and we want that accessible as data in a column cell.
  • We want a display of data that is easier to look at but not easier to process.

Doing it the hard way

Doing it the hard way

df_long <- bind_rows(
  data.frame(DateTime = df_wide$DateTime, key="X1", value=df_wide$X1, stringsAsFactors = FALSE)
  , data.frame(DateTime = df_wide$DateTime, key="X2", value=df_wide$X2, stringsAsFactors = FALSE)
  , data.frame(DateTime = df_wide$DateTime, key="X3", value=df_wide$X3, stringsAsFactors = FALSE))

Doing it the hard way

df_wide2 <- merge(
  setNames(df_long[df_long$key == "X1",c("DateTime", "value")], c("DateTime", "X1"))
  , setNames(df_long[df_long$key == "X2",c("DateTime", "value")], c("DateTime", "X2"))
  , by = "DateTime")

df_wide2 <- merge(
  df_wide2
  , setNames(df_long[df_long$key == "X3",c("DateTime", "value")], c("DateTime", "X3"))
  , by = "DateTime")

Doing it the hard way

  • I make it look like this:
  • … but other, native solutions aren't really much better.

Doing it the easy, tidyr way

df_wide_tidy <- spread(df_long, key, value)

Doing it the easy, tidyr way

df_long_tidy <- gather(df_wide, key, value, c("X1", "X2", "X3"))

#df_long_tidy <- gather(df_wide, key, value, -DateTime) # Alternate
#df_long_tidy <- gather(df_wide, key, value, X1:X3)     # Alternate

Other tidyr functionality

  • Split a column of cells, that has delimeters, into multiple columns.
  • … or multiple rows.
  • Add rows that represent missing combinations of observations and features.

Reference Links

Questions, and Thank You

Matt Pettis

email: matthew.pettis@gmail.com

github: mpettis