tidyverse

September 27, 2017

Tidyverse

The tidyverse is an opinionated collection of R packages designed for data science.

All packages share an underlying philosophy ('tidy') and common APIs.

https://www.tidyverse.org/

Tidyverse Packages

tidy

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

http://r4ds.had.co.nz/tidy-data.html

tidying data: example 1

For example, which of these two datasets are tidy?

Tuberculous Cases for Afghanistan, Brazil & China: 1999-2000

## # A tibble: 6 x 4
##       country  year  cases population
##         <chr> <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

tidying data: example 2

## # A tibble: 12 x 4
##        country  year       type      count
##          <chr> <int>      <chr>      <int>
##  1 Afghanistan  1999      cases        745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000      cases       2666
##  4 Afghanistan  2000 population   20595360
##  5      Brazil  1999      cases      37737
##  6      Brazil  1999 population  172006362
##  7      Brazil  2000      cases      80488
##  8      Brazil  2000 population  174504898
##  9       China  1999      cases     212258
## 10       China  1999 population 1272915272
## 11       China  2000      cases     213766
## 12       China  2000 population 1280428583

tidying functions: spread

Replace reshape (R) and/or "pivot" (Excel) table function.

http://tidyr.tidyverse.org/

tidying with `spread`

spread(table2, key = type, value = count)

## # A tibble: 6 x 4
##       country  year  cases population
## *       <chr> <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

Why tidyverse?

Two Schools of Thought for Teaching R

Base R first:

Teach syntax such as $ and [[]], loops and conditionals, data types, etc.
Possibly introduce dplyr or data.table as alternatives.

Tidyverse first:

Keep a single-minded focus on data frames and dplyr; introduce ggplot2, tidyr and purrr shortly afterwards.
Introduce the %>% operator; skip syntax like [[]] and $ for later.

David Robinson: http://varianceexplained.org/r/teach-tidyverse/

Teach the Tidyverse to Beginners

Get students doing powerful things quickly.
Don’t teach procedural logic (loops and conditionals)
Teach tables (group by and join).
%>% ("piping") isn’t too hard for beginners
"There’s only one way to do it"
Conclusion: You have permission not to be boring

David Robinson: http://varianceexplained.org/r/teach-tidyverse/

dplyr: Six Core Functions

mutate() adds new variables that are functions of existing variables
select() picks variables based on their names.
filter() picks cases based on their values.
summarise() reduces multiple values down to a single summary.
arrange() changes the ordering of the rows.
group_by() aggregates values by group categories.

http://dplyr.tidyverse.org/

dplyr

Load csv (readr) and use dplyr functions.

tweets <- read_csv("../data/CharlotteTweets20Sample.csv") # readr function
df <- group_by(tweets, geo.type)
summarise(df, Count = n())

## # A tibble: 2 x 2
##   geo.type Count
##      <chr> <int>
## 1    Point 21683
## 2  Polygon 25591

dplyr with piping (%>%)

df <- group_by(tweets, geo.type)
summarise(df, Count = n())

## # A tibble: 2 x 2
##   geo.type Count
##      <chr> <int>
## 1    Point 21683
## 2  Polygon 25591

tweets %>%
  group_by(geo.type) %>%
  summarise(Count=n())

## # A tibble: 2 x 2
##   geo.type Count
##      <chr> <int>
## 1    Point 21683
## 2  Polygon 25591

Piping gets really helpful for long commands…

tweets %>% 
  group_by(actor.location.displayName) %>%
  summarise(Count=n()) %>%
  arrange(desc(Count)) %>%
  head(n=5)

## # A tibble: 5 x 2
##   actor.location.displayName Count
##                        <chr> <int>
## 1                       <NA> 10655
## 2              Charlotte, NC  9270
## 3             Kannapolis, NC   962
## 4           Huntersville, NC   823
## 5             North Carolina   657

ggplot2: Visualizations

ggplot(df, aes(x = actor.location.displayName, y = Count)) + 
  geom_bar(stat = "identity") + coord_flip() +
  labs(x = "Profile Location", y = "Number of Tweets", 
       title = "Top 25 Profile Locations of Geolocated Clt Tweets")

purrr: functional programming

Let's start with the starwars dataset.

head(starwars, n=5)

## # A tibble: 5 x 4
##             name height  mass species
##            <chr>  <int> <dbl>   <chr>
## 1 Luke Skywalker    172    77   Human
## 2          C-3PO    167    75   Droid
## 3          R2-D2     96    32   Droid
## 4    Darth Vader    202   136   Human
## 5    Leia Organa    150    49   Human

purrr: functional programming

Regression of height on mass and by simplified species (species2)…

starwars %>%
  mutate(species2 = ifelse(species == "Human", species, "non-Human")) %>%
  split(.$species2) %>% # from base R
  map(~ lm(mass ~ height, data = .)) %>% # linear regression
  map(summary) %>% # keep the summary output
  map_dbl("r.squared")

##      Human  non-Human 
## 0.42465096 0.01932538

http://purrr.tidyverse.org/

broom

lmFit <- starwars %>%
  mutate(species2 = ifelse(species == "Human", species, "non-Human")) %>%
  split(.$species2) %>% # from base R
  map(~ lm(mass ~ height, data = .))

library(broom)
tidy(lmFit[[1]]) # regression for human

##          term    estimate  std.error statistic     p.value
## 1 (Intercept) -116.582591 51.9891504 -2.242441 0.036425452
## 2      height    1.110384  0.2890067  3.842071 0.001017536

tidy(lmFit[[2]]) # regression for non-human

##          term  estimate   std.error   statistic   p.value
## 1 (Intercept) -7.941615 145.7080170 -0.05450363 0.9568529
## 2      height  0.677325   0.8274773  0.81854206 0.4187480

https://cran.r-project.org/web/packages/broom/vignettes/broom.html

rvest for webscraping

library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")

lego_movie %>% 
  html_nodes("strong span") %>%
  html_text() %>%
  as.numeric()

## [1] 7.8

rvest for webscraping

lego_movie %>%
  html_nodes("#titleCast .itemprop span") %>%
  html_text()

##  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"    
##  [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"
##  [7] "Charlie Day"     "Amanda Farinos"  "Keith Ferguson" 
## [10] "Will Ferrell"    "Will Forte"      "Dave Franco"    
## [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"

https://github.com/hadley/rvest

tidytext for simple text analysis

library(tidytext)
tidy_tweets <- tweets %>% 
  unnest_tokens(word, body) %>% # tokenize body column
  anti_join(stop_words) %>% # remove stop words
  count(word, sort = TRUE) # count by tokens
head(tidy_tweets, n = 5)

## # A tibble: 5 x 2
##        word     n
##       <chr> <int>
## 1 charlotte  5672
## 2        nc  3328
## 3      love  1485
## 4       day  1344
## 5  carolina  1240

https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html

Other important packages

haven http://haven.tidyverse.org/
readxl http://readxl.tidyverse.org/
lubridate http://lubridate.tidyverse.org/
stringr http://stringr.tidyverse.org/
forcats http://forcats.tidyverse.org/
packrat https://rstudio.github.io/packrat/

Future: R Markdown for Replication

This document is a R Markdown presentation (ioslides).

Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.

For examples on using R Markdown, see http://rmarkdown.rstudio.com/gallery.html.

Tidyverse

Tidyverse Packages

tidy

tidying data: example 1

tidying data: example 2

tidying functions: spread

tidying with spread

Why tidyverse?

Two Schools of Thought for Teaching R

Teach the Tidyverse to Beginners

dplyr: Six Core Functions

dplyr

dplyr with piping (%>%)

Piping gets really helpful for long commands…

ggplot2: Visualizations

purrr: functional programming

purrr: functional programming

broom

rvest for webscraping

rvest for webscraping

tidytext for simple text analysis

Other important packages

Future: R Markdown for Replication

tidying with `spread`