September 27, 2017

Tidyverse

The tidyverse is an opinionated collection of R packages designed for data science.

All packages share an underlying philosophy ('tidy') and common APIs.

https://www.tidyverse.org/

Tidyverse Packages

tidy

tidying data: example 1

For example, which of these two datasets are tidy?

Tuberculous Cases for Afghanistan, Brazil & China: 1999-2000

## # A tibble: 6 x 4
##       country  year  cases population
##         <chr> <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

tidying data: example 2

## # A tibble: 12 x 4
##        country  year       type      count
##          <chr> <int>      <chr>      <int>
##  1 Afghanistan  1999      cases        745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000      cases       2666
##  4 Afghanistan  2000 population   20595360
##  5      Brazil  1999      cases      37737
##  6      Brazil  1999 population  172006362
##  7      Brazil  2000      cases      80488
##  8      Brazil  2000 population  174504898
##  9       China  1999      cases     212258
## 10       China  1999 population 1272915272
## 11       China  2000      cases     213766
## 12       China  2000 population 1280428583

tidying functions: spread

tidying with spread

spread(table2, key = type, value = count)
## # A tibble: 6 x 4
##       country  year  cases population
## *       <chr> <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

Why tidyverse?

Two Schools of Thought for Teaching R

Base R first:

  • Teach syntax such as $ and [[]], loops and conditionals, data types, etc.

  • Possibly introduce dplyr or data.table as alternatives.

Tidyverse first:

  • Keep a single-minded focus on data frames and dplyr; introduce ggplot2, tidyr and purrr shortly afterwards.

  • Introduce the %>% operator; skip syntax like [[]] and $ for later.

David Robinson: http://varianceexplained.org/r/teach-tidyverse/

Teach the Tidyverse to Beginners

  • Get students doing powerful things quickly.

  • Don’t teach procedural logic (loops and conditionals)

  • Teach tables (group by and join).

  • %>% ("piping") isn’t too hard for beginners

  • "There’s only one way to do it"

  • Conclusion: You have permission not to be boring

David Robinson: http://varianceexplained.org/r/teach-tidyverse/

dplyr: Six Core Functions

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names.
  • filter() picks cases based on their values.
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.
  • group_by() aggregates values by group categories.

http://dplyr.tidyverse.org/

dplyr

Load csv (readr) and use dplyr functions.

tweets <- read_csv("../data/CharlotteTweets20Sample.csv") # readr function
df <- group_by(tweets, geo.type)
summarise(df, Count = n())
## # A tibble: 2 x 2
##   geo.type Count
##      <chr> <int>
## 1    Point 21683
## 2  Polygon 25591

dplyr with piping (%>%)

df <- group_by(tweets, geo.type)
summarise(df, Count = n())
## # A tibble: 2 x 2
##   geo.type Count
##      <chr> <int>
## 1    Point 21683
## 2  Polygon 25591
tweets %>%
  group_by(geo.type) %>%
  summarise(Count=n())
## # A tibble: 2 x 2
##   geo.type Count
##      <chr> <int>
## 1    Point 21683
## 2  Polygon 25591

Piping gets really helpful for long commands…

tweets %>% 
  group_by(actor.location.displayName) %>%
  summarise(Count=n()) %>%
  arrange(desc(Count)) %>%
  head(n=5)
## # A tibble: 5 x 2
##   actor.location.displayName Count
##                        <chr> <int>
## 1                       <NA> 10655
## 2              Charlotte, NC  9270
## 3             Kannapolis, NC   962
## 4           Huntersville, NC   823
## 5             North Carolina   657

ggplot2: Visualizations

ggplot(df, aes(x = actor.location.displayName, y = Count)) + 
  geom_bar(stat = "identity") + coord_flip() +
  labs(x = "Profile Location", y = "Number of Tweets", 
       title = "Top 25 Profile Locations of Geolocated Clt Tweets")

purrr: functional programming

Let's start with the starwars dataset.

head(starwars, n=5)
## # A tibble: 5 x 4
##             name height  mass species
##            <chr>  <int> <dbl>   <chr>
## 1 Luke Skywalker    172    77   Human
## 2          C-3PO    167    75   Droid
## 3          R2-D2     96    32   Droid
## 4    Darth Vader    202   136   Human
## 5    Leia Organa    150    49   Human

purrr: functional programming

Regression of height on mass and by simplified species (species2)…

starwars %>%
  mutate(species2 = ifelse(species == "Human", species, "non-Human")) %>%
  split(.$species2) %>% # from base R
  map(~ lm(mass ~ height, data = .)) %>% # linear regression
  map(summary) %>% # keep the summary output
  map_dbl("r.squared")
##      Human  non-Human 
## 0.42465096 0.01932538

http://purrr.tidyverse.org/

broom

lmFit <- starwars %>%
  mutate(species2 = ifelse(species == "Human", species, "non-Human")) %>%
  split(.$species2) %>% # from base R
  map(~ lm(mass ~ height, data = .))
library(broom)
tidy(lmFit[[1]]) # regression for human
##          term    estimate  std.error statistic     p.value
## 1 (Intercept) -116.582591 51.9891504 -2.242441 0.036425452
## 2      height    1.110384  0.2890067  3.842071 0.001017536
tidy(lmFit[[2]]) # regression for non-human
##          term  estimate   std.error   statistic   p.value
## 1 (Intercept) -7.941615 145.7080170 -0.05450363 0.9568529
## 2      height  0.677325   0.8274773  0.81854206 0.4187480

https://cran.r-project.org/web/packages/broom/vignettes/broom.html

rvest for webscraping

library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")

lego_movie %>% 
  html_nodes("strong span") %>%
  html_text() %>%
  as.numeric()
## [1] 7.8

rvest for webscraping

lego_movie %>%
  html_nodes("#titleCast .itemprop span") %>%
  html_text()
##  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"    
##  [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"
##  [7] "Charlie Day"     "Amanda Farinos"  "Keith Ferguson" 
## [10] "Will Ferrell"    "Will Forte"      "Dave Franco"    
## [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"

https://github.com/hadley/rvest

tidytext for simple text analysis

library(tidytext)
tidy_tweets <- tweets %>% 
  unnest_tokens(word, body) %>% # tokenize body column
  anti_join(stop_words) %>% # remove stop words
  count(word, sort = TRUE) # count by tokens
head(tidy_tweets, n = 5)
## # A tibble: 5 x 2
##        word     n
##       <chr> <int>
## 1 charlotte  5672
## 2        nc  3328
## 3      love  1485
## 4       day  1344
## 5  carolina  1240

https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html

Other important packages

Future: R Markdown for Replication

This document is a R Markdown presentation (ioslides).

Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.

For examples on using R Markdown, see http://rmarkdown.rstudio.com/gallery.html.