July 18, 2018
The tidyverse is an opinionated collection of R packages designed for data science.
All packages share an underlying philosophy (‘tidy’) and common APIs.
Base R first:
Teach syntax such as $ and [[]], loops and conditionals, data types, etc.
Possibly introduce dplyr or data.table as alternatives.
Tidyverse first:
Keep a single-minded focus on data frames and dplyr; introduce ggplot2, tidyr and purrr shortly afterwards.
Introduce the %>% operator; skip syntax like [[]] and $ for later.
David Robinson: http://varianceexplained.org/r/teach-tidyverse/
Get students doing powerful things quickly.
Don’t teach procedural logic (loops and conditionals)
%>% (“piping”) isn’t too hard for beginners
R Projects instead of setwd() each time
“There’s only one way to do it”
Conclusion: You have permission not to be boring
“Tidy datasets are all alike but every messy dataset is messy in its own way.” – Hadley Wickham
For example, which of these two datasets are tidy?
Tuberculous Cases for Afghanistan, Brazil & China: 1999-2000
## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583
spread()Replaces reshape (base R) and/or “pivot” (Excel) table function.
spread()spread(table2, key = type, value = count)
## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
The gather() function does the opposite (i.e., “unpivots”)
Functions are easiest to use when they are:
Simple: They do one thing, and they do it well
Composable: They can be combined with other functions for multi-step operations
# comment: to install tidyverse the first time # install.packages(tidyverse) library(tidyverse)
Use read_csv() (from readr) and select() (from dplyr)…
tweets <- read_csv("../data/CharlotteTweets20Sample.csv") # readr function
samp <- select(tweets, body, geo.type, point_long, point_lat) # view relevant columns
head(samp, n = 5) # view first 5 records
## # A tibble: 5 x 4 ## body geo.type point_long point_lat ## <chr> <chr> <dbl> <dbl> ## 1 Treon to WR is a really good move by Mac,… Polygon NA NA ## 2 primus #vsco #vscocam #primus #primussuck… Point 35.2 -80.8 ## 3 WOAH!!!!!! Polygon NA NA ## 4 clear -> mostly cloudytemperature down… Point 35.5 -80.6 ## 5 Slutty 7 keeps me up to date tho 🤑 Polygon NA NA
group_by() and summarise()tweets <- read_csv("../data/CharlotteTweets20Sample.csv") # readr function
df <- group_by(tweets, geo.type) # group rows by column geo.type
summarise(df, Count = n()) # summarise (count) rows by geo.type
## # A tibble: 2 x 2 ## geo.type Count ## <chr> <int> ## 1 Point 21683 ## 2 Polygon 25591
df <- group_by(tweets, geo.type) summarise(df, Count = n())
## # A tibble: 2 x 2 ## geo.type Count ## <chr> <int> ## 1 Point 21683 ## 2 Polygon 25591
tweets %>% group_by(geo.type) %>% summarise(Count = n())
## # A tibble: 2 x 2 ## geo.type Count ## <chr> <int> ## 1 Point 21683 ## 2 Polygon 25591
tweets %>% group_by(actor.location.displayName) %>% # group by location summarise(Count = n()) %>% # count records arrange(desc(Count)) %>% # rank in descending head(n = 5) # show first five
## # A tibble: 5 x 2 ## actor.location.displayName Count ## <chr> <int> ## 1 <NA> 10655 ## 2 Charlotte, NC 9270 ## 3 Kannapolis, NC 962 ## 4 Huntersville, NC 823 ## 5 North Carolina 657
head(df, n = 5)
## # A tibble: 5 x 2 ## actor.location.displayName Count ## <fct> <int> ## 1 <NA> 10655 ## 2 Charlotte, NC 9270 ## 3 Kannapolis, NC 962 ## 4 Huntersville, NC 823 ## 5 North Carolina 657
ggplot(df, aes(x = actor.location.displayName, y = Count)) + geom_col() # bar chart
ggplot(df, aes(x = actor.location.displayName, y = Count)) + geom_col() + coord_flip() # flip the coordinates
ggplot(df, aes(x = actor.location.displayName, y = Count)) +
geom_col() +
coord_flip() +
labs(x = "Profile Location", y = "Number of Tweets",
title = "Top 20 Profile Locations of Geolocated Clt Tweets") # modify labels
wkdy <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
df2 <- tweets %>%
mutate(Hour = format(postedTime, format="%H"),
Weekday = weekdays(as.Date(postedTime)),
Date = as.Date(postedTime)) %>%
group_by(Hour, Weekday, Date, geo.type) %>%
summarise(Count=n()) %>%
ungroup() %>%
mutate(Weekday = factor(Weekday, levels = wkdy))
head(df2, n=2)
## # A tibble: 2 x 5 ## Hour Weekday Date geo.type Count ## <chr> <fct> <date> <chr> <int> ## 1 00 Friday 2015-12-04 Point 4 ## 2 00 Friday 2015-12-04 Polygon 15
g <- ggplot(df2, aes(x = Hour, y = Count, frame = Weekday, color = geo.type)) +
geom_boxplot(width = 1) +
labs(x = "Hour", y = "Number of Tweets",
title = "Timing of Charlotte Geolocated Tweets by Day:",
caption = "Tweets from Dec 2015 - Feb 2016") +
theme(legend.position = c(0.2,0.8))
gganimate(g, interval = 5)
Let’s start with the starwars dataset.
head(starwars, n=5)
## # A tibble: 5 x 4 ## name height mass species ## <chr> <int> <dbl> <chr> ## 1 Luke Skywalker 172 77. Human ## 2 C-3PO 167 75. Droid ## 3 R2-D2 96 32. Droid ## 4 Darth Vader 202 136. Human ## 5 Leia Organa 150 49. Human
Regression of height on mass and by simplified species (species2)…
starwars %>%
mutate(species2 = ifelse(species == "Human", "Human", "non-Human")) %>%
split(.$species2) %>% # from base R
map(~ lm(mass ~ height, data = .)) %>% # linear regression
map(summary) %>% # keep the summary output
map_dbl("r.squared")
## Human non-Human ## 0.42465096 0.01932538
lmFit <- starwars %>% mutate(species2 = ifelse(species == "Human", "Human", "non-Human")) %>% split(.$species2) %>% # from base R map(~ lm(mass ~ height, data = .))
library(broom) tidy(lmFit[[1]]) # regression for human
## term estimate std.error statistic p.value ## 1 (Intercept) -116.582591 51.9891504 -2.242441 0.036425452 ## 2 height 1.110384 0.2890067 3.842071 0.001017536
tidy(lmFit[[2]]) # regression for non-human
## term estimate std.error statistic p.value ## 1 (Intercept) -7.941615 145.7080170 -0.05450363 0.9568529 ## 2 height 0.677325 0.8274773 0.81854206 0.4187480
https://cran.r-project.org/web/packages/broom/vignettes/broom.html
library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
lego_movie %>%
html_nodes("strong span") %>%
html_text() %>%
as.numeric()
## [1] 7.8
lego_movie %>%
html_nodes("#titleCast .itemprop span") %>%
html_text()
## [1] "Will Arnett" "Elizabeth Banks" "Craig Berry" ## [4] "Alison Brie" "David Burrows" "Anthony Daniels" ## [7] "Charlie Day" "Amanda Farinos" "Keith Ferguson" ## [10] "Will Ferrell" "Will Forte" "Dave Franco" ## [13] "Morgan Freeman" "Todd Hansen" "Jonah Hill"
library(tidytext) tidy_tweets <- tweets %>% unnest_tokens(word, body) %>% # tokenize body column anti_join(stop_words) %>% # remove stop words count(word, sort = TRUE) # count by tokens head(tidy_tweets, n = 5)
## # A tibble: 5 x 2 ## word n ## <chr> <int> ## 1 charlotte 5672 ## 2 nc 3328 ## 3 love 1485 ## 4 day 1344 ## 5 carolina 1240
ggplot(groupedTweets, aes(x = Point, y = Polygon)) + geom_jitter(alpha = 0.1, size = 2.5, width = 0.1, height = 0.1) + geom_text(aes(label = word), check_overlap = TRUE, vjust = 0.5) + scale_x_log10(labels = percent_format()) + scale_y_log10(labels = percent_format()) + geom_abline(color = "red")
Reading in SPSS, SAS, STATA: haven http://haven.tidyverse.org/
Reading in Excel files: readxl http://readxl.tidyverse.org/
Date/time formats: lubridate http://lubridate.tidyverse.org/
Strings: stringr http://stringr.tidyverse.org/
Factors: forcats http://forcats.tidyverse.org/
Package Management: packrat https://rstudio.github.io/packrat/
This document is a R Markdown presentation (ioslides).
Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.
For examples on using R Markdown, see http://rmarkdown.rstudio.com/gallery.html.
Open 01-russian-trolls.R file. Topic Modeling of Russian Troll Tweets