Advanced data wrangling

Smithsonian R Workshop: Data management in R
Brian S Evans
Migratory Bird Center

Goals of this lesson

  • Introduction to the tidyverse
  • tidyR and dplyr review
  • Tidy data
  • Data wrangling in the tidyverse
  • Manipulating strings using stringr

I. The tidyverse: Tibbles and pipes oh my!

alt text alt text alt text

install.packages('tidyverse')
install.packages('stringr')
install.packages('lubridate')

I. The tidyverse: Tibbles and pipes oh my!

A typical data frame:

  subject year value
1       A 2016  13.2
2       B 2016  14.6
3       C 2016  27.1
4       A 2017  26.4
5       B 2017  15.2
6       C 2017  31.3

I. The tidyverse: Tibbles and pipes oh my!

Summary data on a typical data frame:

class(dfLong$subject)
[1] "factor"
str(dfLong)
'data.frame':   6 obs. of  3 variables:
 $ subject: Factor w/ 3 levels "A","B","C": 1 2 3 1 2 3
 $ year   : chr  "2016" "2016" "2016" "2017" ...
 $ value  : num  13.2 14.6 27.1 26.4 15.2 31.3

I. The tidyverse: Tibbles and pipes oh my!

Summary data on a typical data frame:

nrow(dfLong)
[1] 6
ncol(dfLong)
[1] 3
dim(dfLong)
[1] 6 3

I. The tidyverse: Tibbles and pipes oh my!

Tibbles:

  • Show a reduced number of rows and columns, if necessary
  • Provide the dimensions of the data table
  • Provide the class of fields in a data frame
# A tibble: 6 × 3
  subject  year value
   <fctr> <chr> <dbl>
1       A  2016  13.2
2       B  2016  14.6
3       C  2016  27.1
4       A  2017  26.4
5       B  2017  15.2
6       C  2017  31.3

I. The tidyverse: Tibbles and pipes oh my!

The Pipe operator (%>%) allows you to pass output from an argument on the left to an argument on the right without assigning a name or nesting functions.

For example, we can make use the tbl_df function and a pipe to turn a regular data frame to a tibble:


dataFrame %>%
  tbl_df

I. The tidyverse: Tibbles and pipes oh my!

The Pipe operator (%>%) allows you to pass output from an argument on the left to an argument on the right without assigning a name or nesting functions.

For example, we can make use the tbl_df function and a pipe to turn a regular data frame to a tibble:


dataFrame %>%
  tbl_df

Note the convention to start a new line after each pipe. This is to make your code more readable.

I. The tidyverse: Tibbles and pipes oh my!

We can read a data table into R directly as a tibble using the readr function read_csv.

read_csv(dataFrame.csv)
# A tibble: 6 × 3
  subject  year value
   <fctr> <chr> <dbl>
1       A  2016  13.2
2       B  2016  14.6
3       C  2016  27.1
4       A  2017  26.4
5       B  2017  15.2
6       C  2017  31.3

II. Tidyr review


tidyr provides functions for tidying dirty data, including:

  • gather: Convert wide to long data frame
  • separate: Separate one column to several
  • arrange: Order rows by a column

III. dplyr review


dplyr provides us with key data wrangling functions, including:

  • rename: Rename columns of a data frame
  • filter: Subset rows in a data table using logical conditions
  • select: Subset columns in a data table
  • left_join: Bind data tables by common columns
  • bind_cols: Bind the columns of two data tables together by position
  • bind_rows: Bind the rows of two data tables together by position
  • mutate: Compute and apppend a new column to a data table
  • mutate_all: Apply a computation to all columns in a data table
  • group_by: Group rows of data by some grouping variable
  • summarize: Calculate summary information for groups


IV. Rules of the road for tidy data

  • Tidy data are easy to:
    • Manipulate
    • Summarize
    • Analyze
  • Understanding dirty vs. clean data is key

IV. Rules of the road for tidy data


Dirty:

subject mass2016 mass2017
A 13.2 26.4
B 14.6 15.2
C 27.1 31.3

IV. Rules of the road for tidy data


1) Store all data in long format

Tidy (tidyr::gather):

subject year value
A 2016 13.2
B 2016 14.6
C 2016 27.1
A 2017 26.4
B 2017 15.2
C 2017 31.3

IV. Rules of the road for tidy data


Dirty:

subject year value site canopy
A 2016 13.2 1 13.3
B 2016 14.6 1 13.3
C 2016 27.1 2 26.8
A 2017 26.4 1 13.3
B 2017 15.2 1 13.3
C 2017 31.3 2 26.8

IV. Rules of the road for tidy data

2) One level of observation per table

Tidy (dplyr::select):

subject year value site
A 2016 13.2 1
B 2016 14.6 1
C 2016 27.1 2
A 2017 26.4 1
B 2017 15.2 1
C 2017 31.3 2
site canopy
1 13.3
2 26.8

IV. Rules of the road for tidy data


Dirty:

subject year value
A 2016 13.2
B 2016 14.6
C 2016 27.1
subject year value
A 2017 26.4
B 2017 15.2
C 2017 31.3

IV. Rules of the road for tidy data


3) One table per level of observation

Tidy (dplyr::left_join):

subject year value
A 2016 13.2
B 2016 14.6
C 2016 27.1
A 2017 26.4
B 2017 15.2
C 2017 31.3

IV. Rules of the road for tidy data


Dirty:

subject date year value
A 2016-06-12 2016 13.2
B 2016-06-17 2016 14.6
C 2016-07-01 2016 27.1
A 2017-06-14 2017 26.4
B 2017-06-18 2017 15.2
C 2017-06-29 2017 31.3

IV. Rules of the road for tidy data


4) No derived variables!

Tidy (dplyr::select):

subject date value
A 2016-06-12 13.2
B 2016-06-17 14.6
C 2016-07-01 27.1
A 2017-06-14 26.4
B 2017-06-18 15.2
C 2017-06-29 31.3

IV. Rules of the road for tidy data


Dirty:

subject value sexYear
A 13.2 m2016
B 14.6 f2016
C 27.1 f2016
A 26.4 m2017
B 15.2 f2017
C 31.3 f2017

IV. Rules of the road for tidy data


5) One data value per field

Tidy (tidyr::separate):

subject year sex value
A 2016 m_ 13.2
B 2016 f_ 14.6
C 2016 f_ 27.1
A 2017 m_ 26.4
B 2017 f_ 15.2
C 2017 f_ 31.3

IV. Rules of the road for tidy data


6) Always use the international date standard (ISO 8601)

badDate <- '5/16/2016'

goodDate <- as.Date(badDate, '%m/%d/%Y')

goodDate
[1] "2016-05-16"

IV. Rules of the road for tidy data


6) Always use the international date standard (ISO 8601)

month(goodDate)
[1] 5
year(goodDate)
[1] 2016
yday(goodDate)
[1] 137

Task: Filter our dplyr tibble to records from 2015

IV. Rules of the road for tidy data

  1. Store all data in long format
  2. One level of observation per table
  3. One table per level of observation
  4. No derived variables!
  5. One data value per field
  6. Always use the international date standard

V. Data wrangling in the tidyverse: Rules of wrangling

V. Data wrangling in the tidyverse: Rules of wrangling


1) Start with raw data and DO NOT store intermediate files*

Certainly do not do this:

dfLong2016 <- filter(dfLong, year == 2016)

write_csv(dfLong2016, 'dfLong2016.csv')

dfLong2016A <- filter(read_csv('dfLong2016.csv'), subject == 'A')

write_csv(dfLong2016A, 'dfLong2016A.csv')

V. Data wrangling in the tidyverse: Rules of wrangling


2) DO NOT assign names to intermediate files in a script unless absolutely necessary

Do not do this either:

dfLong2016 <- filter(dfLong, year == 2016)

dfLong2016A <- filter(dfLong2016, subject == 'A')

dfLong2016A

V. Data wrangling in the tidyverse: Rules of wrangling


1) Start with raw data and DO NOT store intermediate files*
2) DO NOT assign names to intermediate files in a script unless absolutely necessary

Do this:

dfLong %>%
  filter(
    year == 2016,
    subject == 'A'
    )

V. Data wrangling in the tidyverse: Rules of wrangling


3) Provide a new line of script for each operation, connected by pipes

Do not do this:

dfLong %>% filter(year == 2016) %>% mutate(state = 'ridiculous')

Do this:

dfLong %>%
  filter(year == 2016) %>%
  mutate(state = 'ridiculous')

V. Data wrangling in the tidyverse: Rules of wrangling


4) Provide adequate commenting above operations

# Filter by year and subject:

dfLong %>%
  filter(
    year == 2016,
    subject == 'A'
    )

V. Data wrangling in the tidyverse: Rules of wrangling


5) Begin scripts with any necessary set-up and arrange scripts in ordered sections
(e.g., set-up, wrangling, analysis, figures)


# Load libraries:

library(tidyverse)
library(stringr)
library(lubridate)

# Get data:

dfLong <- read_csv('dfLong.csv')

V. Data wrangling in the tidyverse: Rules of wrangling


6) Do not use setwd in your script, but do remind yourself of your working directory!

Do not do this this:

setwd('C:/Users/Brian/Desktop/gits/rWorkshop')

Do this:

# my working directory: C:/Users/Brian/Desktop/gits/rWorkshop

V. Data wrangling in the tidyverse: Rules of wrangling


7) If an R package is only used once or twice in a script, directly reference the package rather than loading the whole library

dplyr::filter(dfLong, year == 2016)

V. Data wrangling in the tidyverse: Rules of wrangling


8) Delineate sections of script

#---------------------------------------------------------*
# ---- SET-UP ----
#---------------------------------------------------------*

# Load libraries:

library(tidyverse)
library(stringr)
library(lubridate)

V. Data wrangling in the tidyverse: Rules of wrangling

1) Start with raw data and DO NOT store intermediate files*
2) DO NOT assign names to intermediate files in a script unless absolutely necessary
3) Provide a new line of script for each operation, connected by pipes
4) Provide adequate commenting above operations
5) Begin scripts with any necessary set-up
6) Do not use setwd in your script, but do remind yourself of your working directory!
7) If an R package is only used once or twice in a script, directly reference the package
8) Delineate sections of script

VI. String manipulation in stringr

Manipulating strings is a common data task that stringr has greatly simplified.

Here we'll look at the functions:

  • str_to_upper: Make strings all upper case
  • str_detect: Detect a string within a value
  • str_trim: Remove white space from a string
  • str_replace_all: Replace characters in a string
  • str_sub: Extract part of a string by position

VI. String manipulation in stringr

str_to_upper: Make strings all upper case

sp
 [1] "NOCA" "BCCH" "GRCA" "Sosp" "SOSP" "grca" "sosp" "Grca" "HOWR" "bcch"
[11] "AMRO" "howr" "CARW" "noca" "amro" "NOMO" "Amro" "Carw" "carw" "EAPH"
[21] "RBWO"
sp %>%
  str_to_upper %>%
  unique
 [1] "NOCA" "BCCH" "GRCA" "SOSP" "HOWR" "AMRO" "CARW" "NOMO" "EAPH" "RBWO"

Task: Change all of the species in our dplyr tibble to upper case

VI. String manipulation in stringr

str_detect: Detect a string within a value

str_detect('hello world','hello')
[1] TRUE
str_detect('hello world', 'foo')
[1] FALSE

VI. String manipulation in stringr

str_detect: Detect a string within a value

Using in conjunction with a filter:

country population
Afghanistan 30551674
Albania 3173271
Algeria 39208194
American Samoa 55165
Andorra 79218
Angola 21471618
Anguilla 14300
Antigua and Barbuda 89985
Argentina 41446246
Armenia 2976566

VI. String manipulation in stringr

str_detect: Detect a string within a value

Using in conjunction with a filter:

population %>%
  filter(str_detect(country, 'United'))
# A tibble: 4 × 2
                                               country population
                                                 <chr>      <int>
1                                 United Arab Emirates    9346129
2 United Kingdom of Great Britain and Northern Ireland   63136265
3                          United Republic of Tanzania   49253126
4                             United States of America  320050716

VI. String manipulation in stringr

str_detect: Detect a string within a value

Using in conjunction with an ifelse statement:

population %>%
  mutate(
    country = ifelse(str_detect(country, 'Al'),
    'Shazam!', country)
    )
# A tibble: 5 × 2
         country population
           <chr>      <int>
1    Afghanistan   30551674
2        Shazam!    3173271
3        Shazam!   39208194
4 American Samoa      55165
5        Andorra      79218

Task: Use str_detect to subset the data to bandNumbers that include “\t”

VI. String manipulation in stringr

str_trim: Remove white space from a string

messyString <- c('apples ', ' oranges', 'bananas\t')

str_trim(messyString)
[1] "apples"  "oranges" "bananas"

VI. String manipulation in stringr

str_replace_all: Replace characters in a string

messyString <- c('apples112', ' oranges358', 'bananas13')

str_replace_all(messyString, 'apples', 'helloWorld')
[1] "helloWorld112" " oranges358"   "bananas13"    
str_replace_all(messyString, '[a-z]', '')
[1] "112"  " 358" "13"  
str_replace_all(messyString, '[0-9]', '')
[1] "apples"   " oranges" "bananas" 

VI. String manipulation in stringr

str_sub: Extract part of a string by position

messyString <- c('apples', 'oranges', 'bananas')

str_sub(messyString, start = -3)
[1] "les" "ges" "nas"
str_sub(messyString, start  = 3)
[1] "ples"  "anges" "nanas"
str_sub(messyString, end = -3)
[1] "appl"  "orang" "banan"

VI. String manipulation in stringr

Task: There are lots of problems with the band numbers in the file we've been using. Clean them using stringr!