The tidyverse is an opinionated collection of R packages designed for data science.
All packages share an underlying philosophy ('tidy') and common APIs.
September 27, 2017
The tidyverse is an opinionated collection of R packages designed for data science.
All packages share an underlying philosophy ('tidy') and common APIs.
For example, which of these two datasets are tidy?
Tuberculous Cases for Afghanistan, Brazil & China: 1999-2000
## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583
Replace reshape (R) and/or "pivot" (Excel) table function.
spread
spread(table2, key = type, value = count)
## # A tibble: 6 x 4 ## country year cases population ## * <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
Base R first:
Teach syntax such as $ and [[]], loops and conditionals, data types, etc.
Possibly introduce dplyr or data.table as alternatives.
Tidyverse first:
Keep a single-minded focus on data frames and dplyr; introduce ggplot2, tidyr and purrr shortly afterwards.
Introduce the %>% operator; skip syntax like [[]] and $ for later.
David Robinson: http://varianceexplained.org/r/teach-tidyverse/
Get students doing powerful things quickly.
Don’t teach procedural logic (loops and conditionals)
Teach tables (group by and join).
%>% ("piping") isn’t too hard for beginners
"There’s only one way to do it"
Conclusion: You have permission not to be boring
David Robinson: http://varianceexplained.org/r/teach-tidyverse/
mutate()
adds new variables that are functions of existing variablesselect()
picks variables based on their names.filter()
picks cases based on their values.summarise()
reduces multiple values down to a single summary.arrange()
changes the ordering of the rows.group_by()
aggregates values by group categories.Load csv (readr) and use dplyr functions.
tweets <- read_csv("../data/CharlotteTweets20Sample.csv") # readr function df <- group_by(tweets, geo.type) summarise(df, Count = n())
## # A tibble: 2 x 2 ## geo.type Count ## <chr> <int> ## 1 Point 21683 ## 2 Polygon 25591
df <- group_by(tweets, geo.type) summarise(df, Count = n())
## # A tibble: 2 x 2 ## geo.type Count ## <chr> <int> ## 1 Point 21683 ## 2 Polygon 25591
tweets %>% group_by(geo.type) %>% summarise(Count=n())
## # A tibble: 2 x 2 ## geo.type Count ## <chr> <int> ## 1 Point 21683 ## 2 Polygon 25591
tweets %>% group_by(actor.location.displayName) %>% summarise(Count=n()) %>% arrange(desc(Count)) %>% head(n=5)
## # A tibble: 5 x 2 ## actor.location.displayName Count ## <chr> <int> ## 1 <NA> 10655 ## 2 Charlotte, NC 9270 ## 3 Kannapolis, NC 962 ## 4 Huntersville, NC 823 ## 5 North Carolina 657
ggplot(df, aes(x = actor.location.displayName, y = Count)) + geom_bar(stat = "identity") + coord_flip() + labs(x = "Profile Location", y = "Number of Tweets", title = "Top 25 Profile Locations of Geolocated Clt Tweets")
Let's start with the starwars dataset.
head(starwars, n=5)
## # A tibble: 5 x 4 ## name height mass species ## <chr> <int> <dbl> <chr> ## 1 Luke Skywalker 172 77 Human ## 2 C-3PO 167 75 Droid ## 3 R2-D2 96 32 Droid ## 4 Darth Vader 202 136 Human ## 5 Leia Organa 150 49 Human
Regression of height on mass and by simplified species (species2)…
starwars %>% mutate(species2 = ifelse(species == "Human", species, "non-Human")) %>% split(.$species2) %>% # from base R map(~ lm(mass ~ height, data = .)) %>% # linear regression map(summary) %>% # keep the summary output map_dbl("r.squared")
## Human non-Human ## 0.42465096 0.01932538
lmFit <- starwars %>% mutate(species2 = ifelse(species == "Human", species, "non-Human")) %>% split(.$species2) %>% # from base R map(~ lm(mass ~ height, data = .))
library(broom) tidy(lmFit[[1]]) # regression for human
## term estimate std.error statistic p.value ## 1 (Intercept) -116.582591 51.9891504 -2.242441 0.036425452 ## 2 height 1.110384 0.2890067 3.842071 0.001017536
tidy(lmFit[[2]]) # regression for non-human
## term estimate std.error statistic p.value ## 1 (Intercept) -7.941615 145.7080170 -0.05450363 0.9568529 ## 2 height 0.677325 0.8274773 0.81854206 0.4187480
https://cran.r-project.org/web/packages/broom/vignettes/broom.html
library(rvest) lego_movie <- read_html("http://www.imdb.com/title/tt1490017/") lego_movie %>% html_nodes("strong span") %>% html_text() %>% as.numeric()
## [1] 7.8
lego_movie %>% html_nodes("#titleCast .itemprop span") %>% html_text()
## [1] "Will Arnett" "Elizabeth Banks" "Craig Berry" ## [4] "Alison Brie" "David Burrows" "Anthony Daniels" ## [7] "Charlie Day" "Amanda Farinos" "Keith Ferguson" ## [10] "Will Ferrell" "Will Forte" "Dave Franco" ## [13] "Morgan Freeman" "Todd Hansen" "Jonah Hill"
library(tidytext) tidy_tweets <- tweets %>% unnest_tokens(word, body) %>% # tokenize body column anti_join(stop_words) %>% # remove stop words count(word, sort = TRUE) # count by tokens head(tidy_tweets, n = 5)
## # A tibble: 5 x 2 ## word n ## <chr> <int> ## 1 charlotte 5672 ## 2 nc 3328 ## 3 love 1485 ## 4 day 1344 ## 5 carolina 1240
https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html
readxl http://readxl.tidyverse.org/
lubridate http://lubridate.tidyverse.org/
stringr http://stringr.tidyverse.org/
forcats http://forcats.tidyverse.org/
This document is a R Markdown presentation (ioslides).
Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.
For examples on using R Markdown, see http://rmarkdown.rstudio.com/gallery.html.