Tidyverse – the hidden revolution sweeping the R universe!

May 29, 2018

Intro

About me

Niels Ole Dam

Physics and Communications at Roskilde University

Previously:

PM at Experimentarium
Journalist at Illustrated Science
Digital Concept Developer at Bonnier Publications
PM and TM at Peytz & Co
Games Producer at Cape Cph

Since 2014:

Independent Consultant at Things in Flow

Specialties:

Agile Workflows, Personal Productivity, Data Wrangling, Meeting Facilitations, Process Analysis and Visualization, Process Mining, Data Analysis

About R

Why did R need a Tidyverse?

The basics

R is Open Source
R is an old language – first appeared in 1993
R is based on S which is even older – first build in 1976
Easy to build and expand upon with packages
Now more than 12.000 packages

Consequences

Open Source + Old = Lots of contributors over time
Lots of contributers + Old = Lots of old code and lots of different coding paradigms
Ex: 4 (!) different OO systems (Base types, S3, S4 and Reference Classes)
= Rather difficult to (really) learn from scratch

Example of old base R ugliness

# The problem
df <- data.frame(xyz = "a")
df$x
class(df$xyz)
as.numeric(df$xyz)

# The Tidyverse solution
library(dplyr)
df <- data_frame(xyz = "a")
df$x
class(df$xyz)
as.numeric(df$xyz)

But why is R so popular then?

For me:

My tool of choice because it (in my opinion) focuses on the important part of the domain – data science itself.
Data Science vs. Programming
…and then there also those 12.000 packages!
…and all that PhD research baked into new R packages!
…and an ongoing revolution called Tidyverse

A Typical Tidyverse Pipeline

Dataset to tidy up

A subset of data from the World Health Organization Global Tuberculosis Report, and accompanying global populations.

library(tidyverse)
who

Tidying with Python (similar to base R)

df = pd.melt(df, id_vars=["country","year"], value_name="cases", var_name="sex_and_age")
# Extract Sex, Age lower bound and Age upper bound group
tmp_df = df["sex_and_age"].str.extract("(\D)(\d+)(\d{2})")    
# Name columns
tmp_df.columns = ["sex", "age_lower", "age_upper"]
# Create `age`column based on `age_lower` and `age_upper`
tmp_df["age"] = tmp_df["age_lower"] + "-" + tmp_df["age_upper"]
# Merge 
df = pd.concat([df, tmp_df], axis=1)
# Drop unnecessary columns and rows
df = df.drop(['sex_and_age',"age_lower","age_upper"], axis=1)
df = df.dropna()
df = df.sort(ascending=True,columns=["country", "year", "sex", "age"])
df.head(10)

Tidying with Dplyr

library(tidyverse)
who_tidy <- who %>%
    setNames(gsub("newrel", "new_rel", names(.))) %>%
    gather("code", "value", 5:60) %>%
    separate(code, c("new", "var", "sexage")) %>%
    separate(sexage, c("sex", "age"), sep = 1) %>%
    spread(var, value) %>%
    select(-iso2, -iso3)
who

Tidying with Dplyr (contd.)

who_tidy

The force driving Tidyverse

Enter Hadley Wickham…

Enter Hadley Wickham

A man with a course!
“You should behave responsibly R-wise!”
“You should embrace modern engineering practices!”
“You should use R as it was intended!”
“You should respect your peers – and all other human beings!”
[ The above = my take on what drives the guy… ]

Hadleys own take on the "Tidy Tools Manifesto"

“Share data structures”
“Compose simple pieces”
“Embrace Functional Programming”
“Write for humans”
[ https://speakerdeck.com/hadley/tidyverse ]

Ok – but what is this Tidyverse?

Tidyverse – random facts

Focuses on the way data scientists think and work
Started out as “Hadleyverse”
Officially changed name to Tidyverse, Oct. 2016
Great for beginners and experts alike
A collection of consistent R packages supporting a systematic Data Science workflow.

The Tidyverse Workflow

From: speakerdeck.com/hadley/tidyverse

Tidyverse – ordered by function

From [speakerdeck.com/hadley/tidyverse]

Tidyverse – by usage

From: github.com/tidyverse/tidyverse

Tidyverse – Beginners vs. advanced

From barryrowlingson.github.io/hadleyverse/#4

The Tidyverse revolution in numbers

Top-100 downloads last month

library(dplyr); library(rvest); library(jsonlite)

url <- "https://www.tidyverse.org/packages/"
tidy_packages <- url %>%
    read_html() %>% 
    html_nodes("a") %>% 
    html_text() %>% 
    unique()
tidy_packages <- c(tidy_packages,
                   "tidyverse", "dbplyr", "tidyselect", "plyr", "lazyeval")
cran_top_100 <- fromJSON("https://cranlogs.r-pkg.org/top/last-month/100")$downloads
cran_top_100$Tidyverse <- cran_top_100$package %in% tidy_packages
cran_top_100 <- cran_top_100 %>% 
    rownames_to_column() %>% 
    rename(rank = rowname)

Top-100 downloads last month (contd.)

Tidyverse CRAN rankings last month

cran_top_100 %>% 
    filter(Tidyverse == T) %>% 
    select(-Tidyverse)

 rank    package downloads
    4    stringr    503705
    7    ggplot2    439296
    8     tibble    425023
   11      dplyr    398653
   12       glue    390245
   18   magrittr    347332
   23       plyr    336000
   25   jsonlite    308499
   26   lazyeval    308343
   37      purrr    260274
   39     readxl    254079
   40      tidyr    253041
   44  lubridate    246618
   45 tidyselect    243500
   47      readr    238927
   48        hms    238774
   54       httr    224839
   60        DBI    202635
   66      haven    187368
   69    forcats    181827
   77       xml2    156151
   78      broom    155492

Tidyverse percentage of top-100 downloads

library(scales)
tidyverse_pct <- cran_top_100 %>% 
    group_by(Tidyverse) %>% 
    summarise(total = sum(as.numeric(downloads))) %>% 
    mutate(total = percent(total / sum(total)))

 Tidyverse total
     FALSE 75.4%
      TRUE 24.6%

Tidyverse percentage over time

Enough! Just show me the code!

CRAN – The Main Source of R Packages

The anatomy of a R package

My top 10 Tidyverse tricks

Use readr for fileinput – and exploration

[ Demo ]

Tip: readr is integrated into RStudios file navigation

Use lists whenever possible

x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))

[ Demo ]

…and explore the lists interactively

View(outputTables[["INITIALIZE"]])

View(outputTables[["STOP"]])

Use httr for APIs

library(httr)
library(jsonlite)

spotifyOAuth <- function(app_id, client_id, client_secret) {
    spotifyR <- httr::oauth_endpoint(
        authorize = "https://accounts.spotify.com/authorize",
        access = "https://accounts.spotify.com/api/token")
    myapp <- httr::oauth_app(app_id, client_id, client_secret)
    return(httr::oauth2.0_token(spotifyR, myapp, scope = "playlist-modify-public"))
}

keys <- spotifyOAuth("roskilde-2017",
                     "your_client_id",
                     "your_client_secret")

…and get the data

searchArtist <- function(artistName) {
    r <- httr::RETRY("GET", paste0("https://api.spotify.com/v1/search?q=", 
                                   gsub(' ', '+', artistName),"&type=artist&market=DK"),
                     times = 30)
    req <- jsonlite::fromJSON(content(r, "text"))
    if (!is.null(req$artists$total) && req$artists$total > 0) {
        artist <- req$artists$items[,c("id", "name", "popularity", "genres", "type")]
        artist$followers <- as.numeric(req$artists$items$followers$total)
        return(artist)
    } else {
        return(NA)
    }
}

sp_artists_raw <- lapply(rf_artists$encodedName, searchArtist)

Use tibble for more readable code

# The problem
df <- data.frame(artists = c(artists, artists, artists, artists),
                 show_main_period = c(FALSE, TRUE, FALSE, TRUE),
                 schedule_name = c("scheduleUpcoming", "scheduleMain", "scheduleUpcomingWithURL", 
                                   "scheduleMainWithURL"))

# The Tidyverse solution
df <- tribble(
    ~artists, ~show_main_period, ~add_url, ~YEAR,  ~sp_data,   ~path,      ~schedule_name,
    artists,  FALSE,             FALSE,    "2017", sp_artists, "data_out", "scheduleUpcoming",
    artists,  TRUE,              FALSE,    "2017", sp_artists, "data_out", "scheduleMain",
    artists,  FALSE,             TRUE,     "2017", sp_artists, "data_out", "scheduleUpcomingWithURL",
    artists,  TRUE,              TRUE,     "2017", sp_artists, "data_out", "scheduleMainWithURL")

[ Demo - Roskilde Festival ]

Use map, map2 and pmap from purrr instead of apply, sapply, mapply etc.

n <- list(1, 3, 5)
args1 <- list(n, mu, sigma)
args1 %>%
    pmap(rnorm) %>%
    str()

map
map_lgl
map_chr
map_int
map_dbl
map_df
walk

Use dplyr and tidyr for your data pipeline

library(gapminder)
library(tidyr)
library(dplyr)

View(gapminder)

by_country <- gapminder %>%
    mutate(year1950 = year - 1950) %>%
    group_by(continent, country) %>%
    nest()

View(by_country)

[ Gapminder Demo ]

Use purrr::map for your data pipeline

library(purrr)

country_model <- function(df) {
    lm(lifeExp ~ year1950, data = df)
}

by_country2 <- by_country %>%
    mutate(model = map(data, country_model))

View(by_country2)

[ Demo - cont. ]

Use modelr for your data pipeline

library(modelr)

by_country3 <- by_country2 %>% 
    mutate(resids = map2(data, model, add_residuals))

View(by_country3)

[ Demo - cont. ]

Use ggplot2 for your data pipeline

library(ggplot2)

unnest(by_country3, resids) %>%
    ggplot(aes(year, resid, group = country)) +
    geom_line(alpha = 1 / 3) + facet_wrap(~continent)

[ Demo - cont. ]

Use ggplot2 for your data pipeline (Contd.)

Tidyverse piping for the win!

Easy Database access via dplyr

Reuses the dplyr pipeline metaphor
Lazy evaluation
More intuitive way to build a SQL query
Makes code easier to read and maintain
Great for newbies and experts alike
Easy to switch underlying DB system

Dplyr database demo

library(dplyr)
library(dbplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
copy_to(con, nycflights13::flights, "flights",
  temporary = FALSE, 
  indexes = list(
    c("year", "month", "day"), 
    "carrier", 
    "tailnum",
    "dest"
  )
)
flights_db <- tbl(con, "flights")
tailnum_delay_db <- flights_db %>% 
    group_by(tailnum) %>%
    summarise(
        delay = mean(arr_delay, na.rm = TRUE),
        n = n()
    ) %>% 
    arrange(desc(delay)) %>%
    filter(n > 100)
show_query(tailnum_delay_db)
x <- collect(tailnum_delay_db)

Typical dplyr database connections

Five commonly used backends are:

RMySQL connects to MySQL and MariaDB
RPostgreSQL connects to Postgres and Redshift.
RSQLite embeds a SQLite database.
odbc connects to many commercial databases via the open database connectivity protocol.
bigrquery connects to Google’s BigQuery.

Dplyr + Sparklyr for big data

Wrapper around Apache Spark
Easy access to Spark directly from R and RStudio
Local Apache Spark build in if needed
=Great for newbies and experts alike

Dplyr + Sparklyr for big data (contd.)

library(sparklyr)
library(dplyr)
library(nycflights13)
library(ggplot2)

sc <- spark_connect(master="local")
flights <- copy_to(sc, flights, "flights")
airlines <- copy_to(sc, airlines, "airlines")
src_tbls(sc)
data <- flights %>%
    filter(day == 17, month == 5, carrier %in% c('UA', 'WN', 'AA', 'DL')) %>%
    select(year, month, day, carrier, dep_delay, air_time, distance) %>%
    arrange(year, month, day, carrier) %>%
    mutate(air_time_hours = air_time / 60) %>%
    group_by(carrier) %>%
    summarize(count = n(), mean_dep_delay = mean(dep_delay, na.rm = TRUE))
show_query(data)
carrierhours <- collect(data)

[ Demo ]

Things to remember

The Tidyverse needs Tidy data

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
[ From Tidy Data by Hadley Wickham ]

Know where you are in the pipeline

From: speakerdeck.com/hadley/tidyverse

I want all this Tidy Goodness!!! Where do I start?

Hadleys Wickhams talk at Plotcon 2016, New York (30:00 min)

Link: youtube.com

R for Data Science (book) – great introduction to Data Science and Tidyverse.

From http://r4ds.had.co.nz

RStudio Webinars (videos)

Link: rstudio.com/resources/webinars/

Advanced R (book) – great introduction to R as a full fledged Functional Programming language.

From http://adv-r.had.co.nz

R Cheatsheets by RStudio (pdf) – great for easy reference to main packages.

Link: rstudio.com/resources/cheatsheets/

The 5 Steps to Tidyverse

Still not convinced?!

Don’t take my word for it –
other people are fans too!

“The Tidyverse is where we want to be!”

Andrew Gelman
Prof. Columbia University
#1 STAN Core Developer
(among other things)

The Future: A Platform Independent Tidyverse

Link: ursalabs.org

…important because…

Link: blog.revolutionanalytics.com

…important because…

Link: blog.revolutionanalytics.com

So wether you like it or not…

Tidyverse is coming to a galaxy near you!

(and you really should embrace it)

:-D

Thanks!

Niels Ole Dam
Things in Flow
@nielsoledam
nd@thingsinflow.dk
https://thingsinflow.dk

Intro

About me

About R

Why did R need a Tidyverse?

The basics

Consequences

Example of old base R ugliness

But why is R so popular then?

A Typical Tidyverse Pipeline

Dataset to tidy up

Tidying with Python (similar to base R)

Tidying with Dplyr

Tidying with Dplyr (contd.)

The force driving Tidyverse

Enter Hadley Wickham…

Enter Hadley Wickham

Hadleys own take on the "Tidy Tools Manifesto"

Ok – but what is this Tidyverse?

Tidyverse – random facts

The Tidyverse Workflow

Tidyverse – ordered by function

Tidyverse – by usage

Tidyverse – Beginners vs. advanced

The Tidyverse revolution in numbers

Top-100 downloads last month

Top-100 downloads last month (contd.)

Tidyverse CRAN rankings last month

Tidyverse percentage of top-100 downloads

Tidyverse percentage over time

Tidyverse percentage over time

Enough! Just show me the code!

CRAN – The Main Source of R Packages

The anatomy of a R package

My top 10 Tidyverse tricks

Use readr for fileinput – and exploration

Use lists whenever possible

…and explore the lists interactively

Use httr for APIs

…and get the data

Use tibble for more readable code

Use map, map2 and pmap from purrr instead of apply, sapply, mapply etc.

Use dplyr and tidyr for your data pipeline

Use purrr::map for your data pipeline

Use modelr for your data pipeline

Use ggplot2 for your data pipeline

Use ggplot2 for your data pipeline (Contd.)

Tidyverse piping for the win!

Easy Database access via dplyr

Dplyr database demo

Typical dplyr database connections

Dplyr + Sparklyr for big data

Dplyr + Sparklyr for big data (contd.)

Things to remember

The Tidyverse needs Tidy data

Know where you are in the pipeline

I want all this Tidy Goodness!!! Where do I start?

Hadleys Wickhams talk at Plotcon 2016, New York (30:00 min)

R for Data Science (book) – great introduction to Data Science and Tidyverse.

RStudio Webinars (videos)

Advanced R (book) – great introduction to R as a full fledged Functional Programming language.

R Cheatsheets by RStudio (pdf) – great for easy reference to main packages.

The 5 Steps to Tidyverse

Still not convinced?!

Don’t take my word for it – other people are fans too!

The Future: A Platform Independent Tidyverse

…important because…

…important because…

So wether you like it or not…

Thanks!

Don’t take my word for it –
other people are fans too!