Smithsonian R Workshop: Data management in R
Brian S Evans
Migratory Bird Center


install.packages('tidyverse')
install.packages('stringr')
install.packages('lubridate')
A typical data frame:
subject year value
1 A 2016 13.2
2 B 2016 14.6
3 C 2016 27.1
4 A 2017 26.4
5 B 2017 15.2
6 C 2017 31.3
Summary data on a typical data frame:
class(dfLong$subject)
[1] "factor"
str(dfLong)
'data.frame': 6 obs. of 3 variables:
$ subject: Factor w/ 3 levels "A","B","C": 1 2 3 1 2 3
$ year : chr "2016" "2016" "2016" "2017" ...
$ value : num 13.2 14.6 27.1 26.4 15.2 31.3
Summary data on a typical data frame:
nrow(dfLong)
[1] 6
ncol(dfLong)
[1] 3
dim(dfLong)
[1] 6 3
Tibbles:
# A tibble: 6 × 3
subject year value
<fctr> <chr> <dbl>
1 A 2016 13.2
2 B 2016 14.6
3 C 2016 27.1
4 A 2017 26.4
5 B 2017 15.2
6 C 2017 31.3
The Pipe operator (%>%) allows you to pass output from an argument on the left to an argument on the right without assigning a name or nesting functions.
For example, we can make use the tbl_df function and a pipe to turn a regular data frame to a tibble:
dataFrame %>%
tbl_df
The Pipe operator (%>%) allows you to pass output from an argument on the left to an argument on the right without assigning a name or nesting functions.
For example, we can make use the tbl_df function and a pipe to turn a regular data frame to a tibble:
dataFrame %>%
tbl_df
Note the convention to start a new line after each pipe. This is to make your code more readable.
We can read a data table into R directly as a tibble using the readr function read_csv.
read_csv(dataFrame.csv)
# A tibble: 6 × 3
subject year value
<fctr> <chr> <dbl>
1 A 2016 13.2
2 B 2016 14.6
3 C 2016 27.1
4 A 2017 26.4
5 B 2017 15.2
6 C 2017 31.3
tidyr provides functions for tidying dirty data, including:
dplyr provides us with key data wrangling functions, including:

Dirty:
| subject | mass2016 | mass2017 |
|---|---|---|
| A | 13.2 | 26.4 |
| B | 14.6 | 15.2 |
| C | 27.1 | 31.3 |
1) Store all data in long format
Tidy (tidyr::gather):
| subject | year | value |
|---|---|---|
| A | 2016 | 13.2 |
| B | 2016 | 14.6 |
| C | 2016 | 27.1 |
| A | 2017 | 26.4 |
| B | 2017 | 15.2 |
| C | 2017 | 31.3 |
Dirty:
| subject | year | value | site | canopy |
|---|---|---|---|---|
| A | 2016 | 13.2 | 1 | 13.3 |
| B | 2016 | 14.6 | 1 | 13.3 |
| C | 2016 | 27.1 | 2 | 26.8 |
| A | 2017 | 26.4 | 1 | 13.3 |
| B | 2017 | 15.2 | 1 | 13.3 |
| C | 2017 | 31.3 | 2 | 26.8 |
2) One level of observation per table
Tidy (dplyr::select):
| subject | year | value | site |
|---|---|---|---|
| A | 2016 | 13.2 | 1 |
| B | 2016 | 14.6 | 1 |
| C | 2016 | 27.1 | 2 |
| A | 2017 | 26.4 | 1 |
| B | 2017 | 15.2 | 1 |
| C | 2017 | 31.3 | 2 |
| site | canopy |
|---|---|
| 1 | 13.3 |
| 2 | 26.8 |
Dirty:
| subject | year | value |
|---|---|---|
| A | 2016 | 13.2 |
| B | 2016 | 14.6 |
| C | 2016 | 27.1 |
| subject | year | value |
|---|---|---|
| A | 2017 | 26.4 |
| B | 2017 | 15.2 |
| C | 2017 | 31.3 |
3) One table per level of observation
Tidy (dplyr::left_join):
| subject | year | value |
|---|---|---|
| A | 2016 | 13.2 |
| B | 2016 | 14.6 |
| C | 2016 | 27.1 |
| A | 2017 | 26.4 |
| B | 2017 | 15.2 |
| C | 2017 | 31.3 |
Dirty:
| subject | date | year | value |
|---|---|---|---|
| A | 2016-06-12 | 2016 | 13.2 |
| B | 2016-06-17 | 2016 | 14.6 |
| C | 2016-07-01 | 2016 | 27.1 |
| A | 2017-06-14 | 2017 | 26.4 |
| B | 2017-06-18 | 2017 | 15.2 |
| C | 2017-06-29 | 2017 | 31.3 |
4) No derived variables!
Tidy (dplyr::select):
| subject | date | value |
|---|---|---|
| A | 2016-06-12 | 13.2 |
| B | 2016-06-17 | 14.6 |
| C | 2016-07-01 | 27.1 |
| A | 2017-06-14 | 26.4 |
| B | 2017-06-18 | 15.2 |
| C | 2017-06-29 | 31.3 |
Dirty:
| subject | value | sexYear |
|---|---|---|
| A | 13.2 | m2016 |
| B | 14.6 | f2016 |
| C | 27.1 | f2016 |
| A | 26.4 | m2017 |
| B | 15.2 | f2017 |
| C | 31.3 | f2017 |
5) One data value per field
Tidy (tidyr::separate):
| subject | year | sex | value |
|---|---|---|---|
| A | 2016 | m_ | 13.2 |
| B | 2016 | f_ | 14.6 |
| C | 2016 | f_ | 27.1 |
| A | 2017 | m_ | 26.4 |
| B | 2017 | f_ | 15.2 |
| C | 2017 | f_ | 31.3 |
6) Always use the international date standard (ISO 8601)
badDate <- '5/16/2016'
goodDate <- as.Date(badDate, '%m/%d/%Y')
goodDate
[1] "2016-05-16"
6) Always use the international date standard (ISO 8601)
month(goodDate)
[1] 5
year(goodDate)
[1] 2016
yday(goodDate)
[1] 137
Task: Filter our dplyr tibble to records from 2015


1) Start with raw data and DO NOT store intermediate files*
Certainly do not do this:
dfLong2016 <- filter(dfLong, year == 2016)
write_csv(dfLong2016, 'dfLong2016.csv')
dfLong2016A <- filter(read_csv('dfLong2016.csv'), subject == 'A')
write_csv(dfLong2016A, 'dfLong2016A.csv')
2) DO NOT assign names to intermediate files in a script unless absolutely necessary
Do not do this either:
dfLong2016 <- filter(dfLong, year == 2016)
dfLong2016A <- filter(dfLong2016, subject == 'A')
dfLong2016A
1) Start with raw data and DO NOT store intermediate files*
2) DO NOT assign names to intermediate files in a script unless absolutely necessary
Do this:
dfLong %>%
filter(
year == 2016,
subject == 'A'
)
3) Provide a new line of script for each operation, connected by pipes
Do not do this:
dfLong %>% filter(year == 2016) %>% mutate(state = 'ridiculous')
Do this:
dfLong %>%
filter(year == 2016) %>%
mutate(state = 'ridiculous')
4) Provide adequate commenting above operations
# Filter by year and subject:
dfLong %>%
filter(
year == 2016,
subject == 'A'
)
5) Begin scripts with any necessary set-up and arrange scripts in ordered sections
(e.g., set-up, wrangling, analysis, figures)
# Load libraries:
library(tidyverse)
library(stringr)
library(lubridate)
# Get data:
dfLong <- read_csv('dfLong.csv')
6) Do not use setwd in your script, but do remind yourself of your working directory!
Do not do this this:
setwd('C:/Users/Brian/Desktop/gits/rWorkshop')
Do this:
# my working directory: C:/Users/Brian/Desktop/gits/rWorkshop
7) If an R package is only used once or twice in a script, directly reference the package rather than loading the whole library
dplyr::filter(dfLong, year == 2016)
8) Delineate sections of script
#---------------------------------------------------------*
# ---- SET-UP ----
#---------------------------------------------------------*
# Load libraries:
library(tidyverse)
library(stringr)
library(lubridate)
1) Start with raw data and DO NOT store intermediate files*
2) DO NOT assign names to intermediate files in a script unless absolutely necessary
3) Provide a new line of script for each operation, connected by pipes
4) Provide adequate commenting above operations
5) Begin scripts with any necessary set-up
6) Do not use setwd in your script, but do remind yourself of your working directory!
7) If an R package is only used once or twice in a script, directly reference the package
8) Delineate sections of script

Manipulating strings is a common data task that stringr has greatly simplified.
Here we'll look at the functions:
str_to_upper: Make strings all upper case
sp
[1] "NOCA" "BCCH" "GRCA" "Sosp" "SOSP" "grca" "sosp" "Grca" "HOWR" "bcch"
[11] "AMRO" "howr" "CARW" "noca" "amro" "NOMO" "Amro" "Carw" "carw" "EAPH"
[21] "RBWO"
sp %>%
str_to_upper %>%
unique
[1] "NOCA" "BCCH" "GRCA" "SOSP" "HOWR" "AMRO" "CARW" "NOMO" "EAPH" "RBWO"
Task: Change all of the species in our dplyr tibble to upper case
str_detect: Detect a string within a value
str_detect('hello world','hello')
[1] TRUE
str_detect('hello world', 'foo')
[1] FALSE
str_detect: Detect a string within a value
Using in conjunction with a filter:
| country | population |
|---|---|
| Afghanistan | 30551674 |
| Albania | 3173271 |
| Algeria | 39208194 |
| American Samoa | 55165 |
| Andorra | 79218 |
| Angola | 21471618 |
| Anguilla | 14300 |
| Antigua and Barbuda | 89985 |
| Argentina | 41446246 |
| Armenia | 2976566 |
str_detect: Detect a string within a value
Using in conjunction with a filter:
population %>%
filter(str_detect(country, 'United'))
# A tibble: 4 × 2
country population
<chr> <int>
1 United Arab Emirates 9346129
2 United Kingdom of Great Britain and Northern Ireland 63136265
3 United Republic of Tanzania 49253126
4 United States of America 320050716
str_detect: Detect a string within a value
Using in conjunction with an ifelse statement:
population %>%
mutate(
country = ifelse(str_detect(country, 'Al'),
'Shazam!', country)
)
# A tibble: 5 × 2
country population
<chr> <int>
1 Afghanistan 30551674
2 Shazam! 3173271
3 Shazam! 39208194
4 American Samoa 55165
5 Andorra 79218
Task: Use str_detect to subset the data to bandNumbers that include “\t”
str_trim: Remove white space from a string
messyString <- c('apples ', ' oranges', 'bananas\t')
str_trim(messyString)
[1] "apples" "oranges" "bananas"
str_replace_all: Replace characters in a string
messyString <- c('apples112', ' oranges358', 'bananas13')
str_replace_all(messyString, 'apples', 'helloWorld')
[1] "helloWorld112" " oranges358" "bananas13"
str_replace_all(messyString, '[a-z]', '')
[1] "112" " 358" "13"
str_replace_all(messyString, '[0-9]', '')
[1] "apples" " oranges" "bananas"
str_sub: Extract part of a string by position
messyString <- c('apples', 'oranges', 'bananas')
str_sub(messyString, start = -3)
[1] "les" "ges" "nas"
str_sub(messyString, start = 3)
[1] "ples" "anges" "nanas"
str_sub(messyString, end = -3)
[1] "appl" "orang" "banan"
Task: There are lots of problems with the band numbers in the file we've been using. Clean them using stringr!
