July 18, 2018

R

RStudio

Tidyverse

The tidyverse is an opinionated collection of R packages designed for data science.

All packages share an underlying philosophy (‘tidy’) and common APIs.

https://www.tidyverse.org/

What are R packages?

What are R packages?

What are R packages?

How to install/run R packages?

How to install/run R packages?

Tidyverse Packages

Why tidyverse?

Why tidyverse?

Two Schools of Thought for Teaching R

Base R first:

  • Teach syntax such as $ and [[]], loops and conditionals, data types, etc.

  • Possibly introduce dplyr or data.table as alternatives.

Tidyverse first:

  • Keep a single-minded focus on data frames and dplyr; introduce ggplot2, tidyr and purrr shortly afterwards.

  • Introduce the %>% operator; skip syntax like [[]] and $ for later.

David Robinson: http://varianceexplained.org/r/teach-tidyverse/

Teach the Tidyverse to Beginners

  • Get students doing powerful things quickly.

  • Don’t teach procedural logic (loops and conditionals)

    • Teach tables (group by and join).
  • %>% (“piping”) isn’t too hard for beginners

  • R Projects instead of setwd() each time

  • “There’s only one way to do it”

  • Conclusion: You have permission not to be boring

David’s 2018 RStudio::conf talk on teaching tidyverse

What is tidy?

Wickham, 2014

“Tidy datasets are all alike but every messy dataset is messy in its own way.” – Hadley Wickham

tidy data

tidying data: example 1

For example, which of these two datasets are tidy?

Tuberculous Cases for Afghanistan, Brazil & China: 1999-2000

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

tidying data: example 2

## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

tidying functions: spread()

tidying with spread()

spread(table2, key = type, value = count)
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

The gather() function does the opposite (i.e., “unpivots”)

tidy principle

Functions are easiest to use when they are:

  1. Simple: They do one thing, and they do it well

  2. Composable: They can be combined with other functions for multi-step operations

Core packages: dplyr, ggplot2, piping (%>%)

dplyr: Data Transformation

Run tidyverse

# comment: to install tidyverse the first time
# install.packages(tidyverse) 

library(tidyverse)

dplyr

Use read_csv() (from readr) and select() (from dplyr)…

tweets <- read_csv("../data/CharlotteTweets20Sample.csv") # readr function

samp <- select(tweets, body, geo.type, point_long, point_lat) # view relevant columns

head(samp, n = 5) # view first 5 records
## # A tibble: 5 x 4
##   body                                       geo.type point_long point_lat
##   <chr>                                      <chr>         <dbl>     <dbl>
## 1 Treon to WR is a really good move by Mac,… Polygon        NA        NA  
## 2 primus #vsco #vscocam #primus #primussuck… Point          35.2     -80.8
## 3 WOAH!!!!!!                                 Polygon        NA        NA  
## 4 clear -&gt; mostly cloudytemperature down… Point          35.5     -80.6
## 5 Slutty 7 keeps me up to date tho 🤑        Polygon        NA        NA

dplyr: group_by() and summarise()

tweets <- read_csv("../data/CharlotteTweets20Sample.csv") # readr function

df <- group_by(tweets, geo.type) # group rows by column geo.type

summarise(df, Count = n()) # summarise (count) rows by geo.type
## # A tibble: 2 x 2
##   geo.type Count
##   <chr>    <int>
## 1 Point    21683
## 2 Polygon  25591

dplyr with piping (%>%)

df <- group_by(tweets, geo.type)
summarise(df, Count = n())
## # A tibble: 2 x 2
##   geo.type Count
##   <chr>    <int>
## 1 Point    21683
## 2 Polygon  25591
tweets %>%
  group_by(geo.type) %>%
  summarise(Count = n())
## # A tibble: 2 x 2
##   geo.type Count
##   <chr>    <int>
## 1 Point    21683
## 2 Polygon  25591

Piping gets really helpful for long commands…

tweets %>% 
  group_by(actor.location.displayName) %>% # group by location
  summarise(Count = n()) %>% # count records
  arrange(desc(Count)) %>% # rank in descending
  head(n = 5) # show first five
## # A tibble: 5 x 2
##   actor.location.displayName Count
##   <chr>                      <int>
## 1 <NA>                       10655
## 2 Charlotte, NC               9270
## 3 Kannapolis, NC               962
## 4 Huntersville, NC             823
## 5 North Carolina               657

ggplot2: Visualizations

ggplot2: Visualizations

ggplot2: Visualizations

ggplot2: Visualizations

head(df, n = 5)
## # A tibble: 5 x 2
##   actor.location.displayName Count
##   <fct>                      <int>
## 1 <NA>                       10655
## 2 Charlotte, NC               9270
## 3 Kannapolis, NC               962
## 4 Huntersville, NC             823
## 5 North Carolina               657

ggplot2: bar chart

ggplot(df, aes(x = actor.location.displayName, y = Count)) + 
  geom_col() # bar chart

ggplot2: bar chart

ggplot(df, aes(x = actor.location.displayName, y = Count)) + 
  geom_col() + 
  coord_flip() # flip the coordinates

ggplot2: Visualizations

ggplot(df, aes(x = actor.location.displayName, y = Count)) + 
  geom_col() + 
  coord_flip() +
  labs(x = "Profile Location", y = "Number of Tweets", 
       title = "Top 20 Profile Locations of Geolocated Clt Tweets") # modify labels

ggplot = layers

ggplot2 cheatsheet

ggplot example: gganimate

wkdy <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")

df2 <- tweets %>%
  mutate(Hour = format(postedTime, format="%H"), 
           Weekday = weekdays(as.Date(postedTime)), 
           Date = as.Date(postedTime)) %>%
  group_by(Hour, Weekday, Date, geo.type) %>%
  summarise(Count=n()) %>%
  ungroup() %>%
  mutate(Weekday = factor(Weekday, levels = wkdy))

head(df2, n=2)
## # A tibble: 2 x 5
##   Hour  Weekday Date       geo.type Count
##   <chr> <fct>   <date>     <chr>    <int>
## 1 00    Friday  2015-12-04 Point        4
## 2 00    Friday  2015-12-04 Polygon     15

ggplot example: gganimate

g <- ggplot(df2, aes(x = Hour, y = Count, frame = Weekday, color = geo.type)) + 
  geom_boxplot(width = 1) +
  labs(x = "Hour", y = "Number of Tweets", 
       title = "Timing of Charlotte Geolocated Tweets by Day:",
       caption = "Tweets from Dec 2015 - Feb 2016") +
  theme(legend.position = c(0.2,0.8))

gganimate(g, interval = 5)

ggplot example: gganimate

ggplot2 extensions

tidyverse advanced packages

purrr: functional programming

Let’s start with the starwars dataset.

head(starwars, n=5)
## # A tibble: 5 x 4
##   name           height  mass species
##   <chr>           <int> <dbl> <chr>  
## 1 Luke Skywalker    172   77. Human  
## 2 C-3PO             167   75. Droid  
## 3 R2-D2              96   32. Droid  
## 4 Darth Vader       202  136. Human  
## 5 Leia Organa       150   49. Human

purrr: functional programming

Regression of height on mass and by simplified species (species2)…

starwars %>%
  mutate(species2 = ifelse(species == "Human", "Human", "non-Human")) %>%
  split(.$species2) %>% # from base R
  map(~ lm(mass ~ height, data = .)) %>% # linear regression
  map(summary) %>% # keep the summary output
  map_dbl("r.squared")
##      Human  non-Human 
## 0.42465096 0.01932538

http://purrr.tidyverse.org/

broom: analyze model results

lmFit <- starwars %>%
  mutate(species2 = ifelse(species == "Human", "Human", "non-Human")) %>%
  split(.$species2) %>% # from base R
  map(~ lm(mass ~ height, data = .))
library(broom)
tidy(lmFit[[1]]) # regression for human
##          term    estimate  std.error statistic     p.value
## 1 (Intercept) -116.582591 51.9891504 -2.242441 0.036425452
## 2      height    1.110384  0.2890067  3.842071 0.001017536
tidy(lmFit[[2]]) # regression for non-human
##          term  estimate   std.error   statistic   p.value
## 1 (Intercept) -7.941615 145.7080170 -0.05450363 0.9568529
## 2      height  0.677325   0.8274773  0.81854206 0.4187480

https://cran.r-project.org/web/packages/broom/vignettes/broom.html

rvest for webscraping

library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")

lego_movie %>% 
  html_nodes("strong span") %>%
  html_text() %>%
  as.numeric()
## [1] 7.8

rvest for webscraping

lego_movie %>%
  html_nodes("#titleCast .itemprop span") %>%
  html_text()
##  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"    
##  [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"
##  [7] "Charlie Day"     "Amanda Farinos"  "Keith Ferguson" 
## [10] "Will Ferrell"    "Will Forte"      "Dave Franco"    
## [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"

https://github.com/hadley/rvest

SelectorGadget

tidytext for simple text analysis

library(tidytext)
tidy_tweets <- tweets %>% 
  unnest_tokens(word, body) %>% # tokenize body column
  anti_join(stop_words) %>% # remove stop words
  count(word, sort = TRUE) # count by tokens

head(tidy_tweets, n = 5)
## # A tibble: 5 x 2
##   word          n
##   <chr>     <int>
## 1 charlotte  5672
## 2 nc         3328
## 3 love       1485
## 4 day        1344
## 5 carolina   1240

More advanced tidytext

ggplot(groupedTweets, aes(x = Point, y = Polygon)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.1, height = 0.1) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 0.5) +
  scale_x_log10(labels = percent_format()) + scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")

Other important “tidy” packages

R Markdown for Replication

This document is a R Markdown presentation (ioslides).

Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents.

For examples on using R Markdown, see http://rmarkdown.rstudio.com/gallery.html.

Time to Practice!