Data Manipulation - tidyverse and dplyr

Ben Bellman
August 28, 2018

tidyverse vs. base R

So far, material has been from base packages
Some base functions are still useful, but it's dated
Instead, I want you to learn the tidyverse:
- A foundational suite of packages for data analysis
- A philosphy about programming style and best practices
- People are increasingly making R “tidy”
- sf, tidytext, tidygraph, tidybayes, etc.
- Spearheaded by RStudio development team

Loading packages

We can manage packages through commands
Install them once, but library at the start of every session

install.packages("tidyverse")
library(tidyverse)

The tidyverse Suite

These slides cover:
- tibble : A “modern” version of a data frame
- readr : Tools for importing text data files
- dplyr : Tools for data manipulation and analysis
- some other tidy extras I find useful!
Includes five other core packages, like ggplot2 and stringr

The tidyverse Philosophy

Tidy data is data where:
1. Each variable is in a column.
2. Each observation is a row.
3. Each value is a cell.
Code is written to be human legible, read like a recipe
- Implemented with %>%, the piping operator
Tidy data objects can be analyzed with all tidyverse functions

The Tibble

tibble package introduces a data frame variant: "tbl"
Some useful technical differences
- Good console printing
- No row names
- Does not convert character columns to factors

tbl <- tibble(
  x = 1:4, 
  y = 1, 
  z = x ^ 2 + y
)
tbl

# A tibble: 4 x 3
      x     y     z
  <int> <dbl> <dbl>
1     1     1     2
2     2     1     5
3     3     1    10
4     4     1    17

The Tibble

Referencing tibbles by column:

tbl$x

[1] 1 2 3 4

tbl[["x"]]

[1] 1 2 3 4

tbl[[1]]

[1] 1 2 3 4

Using readr

Introduces new family of read_ functions
- read_csv(): comma separated (CSV) files
- read_tsv(): tab separated files
- read_delim(): general delimited files
- read_fwf(): fixed width files
- read_table(): tabular files where colums are separated by white-space.
- read_log(): web log files
Functions from readr return data as tibbles

Using readr

Automatically prints class of columns

library(here)
salaries <- read_csv(here("data","white-house-salaries.csv"))

# useful tibble function to preview data
glimpse(salaries)

Observations: 7,108
Variables: 9
$ employee_name <chr> "abraham, yohannes a", "abrams, adam w", "adams,...
$ salary        <dbl> 40000, 65000, 36000, 92000, 42800, 130500, 57000...
$ position      <chr> "legislative assistant and assistant to the hous...
$ year          <int> 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, ...
$ status        <chr> "employee", "employee", "employee", "employee", ...
$ party         <chr> "democrat", "democrat", "democrat", "democrat", ...
$ president     <chr> "obama", "obama", "obama", "obama", "obama", "ob...
$ term          <chr> "first", "first", "first", "first", "first", "fi...
$ gender        <chr> "male", "male", "male", "male", "female", "male"...

readr + haven + readxl

The haven package extends the read_ family to data from other statistics software formats
- read_dta()
- read_sas()
- read_spss()
The readxl package loads Excel spreadsheets into R
- read_xlsx()
Must be installed and loaded separately

Starting with dplyr

dplyr is how we can manipulate the data and run calculations
Five main functions or “verbs” that operate on data frames
- select() picks variables based on their names.
- filter() picks cases based on their values.
- arrange() changes the ordering of the rows.
- mutate() adds new variables that are functions of existing variables
- summarise() calculates single values from many rows.
Other important functions and variants included
- e.g. group_by() and ungroup()

Starting with dplyr

Functions in dplyr always take a data frame as the first argment
Subsequent arguments will refer to variables, or define new variables

Practice Data

We will use the White House Salaries dataset to learn these functions.
Nine variables
- Employee name, inferred gender
- Salary
- Position, full/part time status, year
- President, party, term
Link to data.world page for downloading and more info

select() function

select() creates a new tibble, keeping only the desired columns

names(salaries)    # base R, read/change vector of column names

[1] "employee_name" "salary"        "position"      "year"         
[5] "status"        "party"         "president"     "term"         
[9] "gender"

select(salaries, president, party, year)

# A tibble: 7,108 x 3
   president party     year
   <chr>     <chr>    <int>
 1 obama     democrat  2009
 2 obama     democrat  2009
 3 obama     democrat  2009
 4 obama     democrat  2009
 5 obama     democrat  2009
 6 obama     democrat  2009
 7 obama     democrat  2009
 8 obama     democrat  2009
 9 obama     democrat  2009
10 obama     democrat  2009
# ... with 7,098 more rows

select() function

There are other functions that work inside of select()
See: starts_with() or contains() or matches()

select(salaries, starts_with("s"))

# A tibble: 7,108 x 2
   salary status  
    <dbl> <chr>   
 1  40000 employee
 2  65000 employee
 3  36000 employee
 4  92000 employee
 5  42800 employee
 6 130500 employee
 7  57000 employee
 8  54768 employee
 9  55000 employee
10  50000 employee
# ... with 7,098 more rows

rename() function

class: small-code

rename() returns the full data, but changes variable names

rename(salaries, name = employee_name)

# A tibble: 7,108 x 9
   name    salary position        year status party president term  gender
   <chr>    <dbl> <chr>          <int> <chr>  <chr> <chr>     <chr> <chr> 
 1 abraha…  40000 legislative a…  2009 emplo… demo… obama     first male  
 2 abrams…  65000 western regio…  2009 emplo… demo… obama     first male  
 3 adams,…  36000 executive ass…  2009 emplo… demo… obama     first male  
 4 agnew,…  92000 deputy direct…  2009 emplo… demo… obama     first male  
 5 ahrens…  42800 operator        2009 emplo… demo… obama     first female
 6 aldy, … 130500 special assis…  2009 emplo… demo… obama     first male  
 7 alvara…  57000 domestic dire…  2009 emplo… demo… obama     first female
 8 amorsi…  54768 special assis…  2009 emplo… demo… obama     first male  
 9 anders…  55000 executive ass…  2009 emplo… demo… obama     first female
10 anders…  50000 policy assist…  2009 emplo… demo… obama     first male  
# ... with 7,098 more rows

filter() function

Uses logical statements to subset rows, like if-statements in Stata
- Returns rows where statement is true

yr_2013 <- filter(salaries, year == 2013)
unique(yr_2013$year)

[1] 2013

unique(yr_2013$president)

[1] "obama"

Logical operators

Description	Symbol
less than	`<`
less than or equal to	`<=`
greater than	`>`
greater than or equal to	`>=`
exactly equal to	`==`
not equal to	`!=`
Not x	`!x`
x OR y	`x \| y`
x AND y	`x & y`
x within the set of y	`x %in% y`
x is missing	`is.na(x)`

slice() function

slice() lets you subset rows by order in the table
- Similar to using [, ] notation

slice(salaries, 100:105)

# A tibble: 6 x 9
  employee_name  salary position  year status party president term  gender
  <chr>           <dbl> <chr>    <int> <chr>  <chr> <chr>     <chr> <chr> 
1 davies, susan… 130500 special…  2009 emplo… demo… obama     first female
2 davis, christ…  75000 senior …  2009 emplo… demo… obama     first male  
3 de sio, jr, h… 153300 deputy …  2009 emplo… demo… obama     first male  
4 de vos, erica   60000 directo…  2009 emplo… demo… obama     first male  
5 deese, brian c 130500 special…  2009 emplo… demo… obama     first male  
6 deguzman, bri…  41900 operator  2009 emplo… demo… obama     first male

arrange() function

Lets you reorder data rows according to variables
- Returns tibble, where base::order() returns row numbers
- use select() to reorder columns

arrange(salaries, employee_name, year)

# A tibble: 7,108 x 9
   employee_name salary position  year status party president term  gender
   <chr>          <dbl> <chr>    <int> <chr>  <chr> <chr>     <chr> <chr> 
 1 abbot, anita…  55500 ethics …  2001 emplo… repu… bush      first female
 2 abdullah, ha… 105960 policy …  2015 detai… demo… obama     seco… male  
 3 aberger, mar…  42000 press a…  2012 emplo… demo… obama     first female
 4 aberger, mar…  42000 press a…  2013 emplo… demo… obama     first female
 5 abizaid, chr… 156552 advisor…  2016 detai… demo… obama     seco… female
 6 abney, allen…  45000 spokesm…  2003 emplo… repu… bush      first male  
 7 abney, allen…  45000 spokesm…  2004 emplo… repu… bush      first male  
 8 abney, allen…  55000 spokesm…  2005 emplo… repu… bush      seco… male  
 9 abraham, sab…  55000 energy …  2015 emplo… demo… obama     seco… male  
10 abraham, yoh…  40000 legisla…  2009 emplo… demo… obama     first male  
# ... with 7,098 more rows

arrange() function

The desc() function sorts in descending order

arrange(salaries, desc(salary))

# A tibble: 7,108 x 9
   employee_name salary position  year status party president term  gender
   <chr>          <dbl> <chr>    <int> <chr>  <chr> <chr>     <chr> <chr> 
 1 wheeler, set… 225000 senior …  2013 detai… demo… obama     first male  
 2 marcozzi, da… 192934 directo…  2009 detai… demo… obama     first male  
 3 house, mark s 187100 senior …  2017 detai… repu… trump     first male  
 4 hash, michae… 179700 deputy …  2010 detai… demo… obama     first male  
 5 love, timoth… 179700 policy …  2010 detai… demo… obama     first male  
 6 bannon, step… 179700 assista…  2017 emplo… repu… trump     first male  
 7 bossert, tho… 179700 assista…  2017 emplo… repu… trump     first male  
 8 bremberg, an… 179700 assista…  2017 emplo… repu… trump     first male  
 9 conway, kell… 179700 assista…  2017 emplo… repu… trump     first female
10 dearborn, ri… 179700 assista…  2017 emplo… repu… trump     first male  
# ... with 7,098 more rows

mutate() function

Creates new variables based on other

new <- mutate(salaries, in_1000s = salary / 1000)
names(new)

 [1] "employee_name" "salary"        "position"      "year"         
 [5] "status"        "party"         "president"     "term"         
 [9] "gender"        "in_1000s"

summary(new$in_1000s)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   45.00   61.95   76.41  102.00  225.00

transmute() function

transmute() works the same way, but only returns new columns
Useful if you only need the result

transmute(salaries, in_1000s = salary / 1000)

# A tibble: 7,108 x 1
   in_1000s
      <dbl>
 1     40  
 2     65  
 3     36  
 4     92  
 5     42.8
 6    130. 
 7     57  
 8     54.8
 9     55  
10     50  
# ... with 7,098 more rows

if_else() function

Useful for creating dummy variables, applying functions based on a conditional statement

new <- transmute(salaries, democrat = if_else(party == "democrat", 1, 0))
summary(new)

    democrat     
 Min.   :0.0000  
 1st Qu.:0.0000  
 Median :1.0000  
 Mean   :0.5262  
 3rd Qu.:1.0000  
 Max.   :1.0000

summarise() function

summarise() returns a new tibble with results of aggregate functions
- Best used with group_by() function
- Adds group metadata to data object, which is removed with ungroup()
- Grouping affects behavior of functions
- summarise() output will have grouping variable as a column

salaries %>%
  group_by(gender) %>%
  summarise(mean_salary = mean(salary))

# A tibble: 2 x 2
  gender mean_salary
  <chr>        <dbl>
1 female      70413.
2 male        82301.

Using the "pipe" operator

The %>% operator is the glue of tidyverse style
- Originally from magrittr package
Readable way to nest functions
Automatically inserts previous object/result into the first argument of the next function
- Creates recipe-like style, each process appears in sequence
- The result being “piped” can also be referenced with the . placeholder

Using the dplyr workflow

Let's practice using these functions together

by_year <- salaries %>%
  group_by(year) %>%
  summarise(mean_salary = mean(salary))
by_year

# A tibble: 16 x 2
    year mean_salary
   <int>       <dbl>
 1  2001      58042.
 2  2003      65794.
 3  2004      65794.
 4  2005      64580.
 5  2006      65928.
 6  2007      69894.
 7  2008      74257.
 8  2009      80384.
 9  2010      82721.
10  2011      81765.
11  2012      80843.
12  2013      82304.
13  2014      82844.
14  2015      84864.
15  2016      84224.
16  2017      94872

Let's create a new column with the length of tenure in years

salaries <- salaries %>%
  rename(name = employee_name) %>%
  group_by(name) %>%
  mutate(tenure = rank(year))

select(salaries, name, tenure, salary)

# A tibble: 7,108 x 3
# Groups:   name [3,321]
   name                 tenure salary
   <chr>                 <dbl>  <dbl>
 1 abraham, yohannes a       1  40000
 2 abrams, adam w            1  65000
 3 adams, ian h              1  36000
 4 agnew, david p            1  92000
 5 ahrens, rebecca a         2  42800
 6 aldy, jr, joseph e        1 130500
 7 alvarado, lissette a      1  57000
 8 amorsingh, lucius l       1  54768
 9 anderson, amanda d        1  55000
10 anderson, charles d       1  50000
# ... with 7,098 more rows

Analyze employees by total tenure in White House

by_tenure <- salaries %>% 
  group_by(name) %>%                      # look by employee
  summarise(tenure = max(tenure)) %>%     # get full tenure length
  group_by(tenure) %>%                    # new grouping by tenure
  summarise(count = n()) %>%
  mutate(pct = count / sum(count) * 100)

by_tenure

# A tibble: 16 x 3
   tenure count     pct
    <dbl> <int>   <dbl>
 1      1  1593 48.0   
 2      2   832 25.1   
 3      3   459 13.8   
 4      4   185  5.57  
 5      5   100  3.01  
 6      6    63  1.90  
 7      7    27  0.813 
 8      8    24  0.723 
 9      9     9  0.271 
10     10     4  0.120 
11     11     5  0.151 
12     12     1  0.0301
13     13     3  0.0903
14     14     3  0.0903
15     15     2  0.0602
16     16    11  0.331

Questions?

Resources
- https://www.tidyverse.org/
- Online book: R for Data Science (http://r4ds.had.co.nz/)
- base R vs. tidyverse code
- RStudio cheatsheets
  - Help menu -> Cheatsheets