Advanced data wrangling

Goals of this lesson

  • Don't be a dirty data maker!
  • Introduction to the tidyverse
  • Reading in tibbles
  • Manipulating strings using stringr
  • Combining data tables using join and bind
  • Working with dates
  • Selecting fields (review)
  • Subsetting with %in%
  • Summarizing data (review and new)

I. Rules of the road for tidy data

1) Store all data in long format
2) One table for each level of observation
3) Always start with raw data … no derived variables!
4) One data value per field

I. Rules of the road for tidy data

1) Store all data in long format

Wide format data:

subject mass2016 mass2017
A 13.2 26.4
B 14.6 15.2
C 27.1 31.3

I. Rules of the road for tidy data

1) Store all data in long format

Long format data:

subject year value
A 2016 13.2
B 2016 14.6
C 2016 27.1
A 2017 26.4
B 2017 15.2
C 2017 31.3

I. Rules of the road for tidy data

2) One table for each level of observation

Don't do this:

subject year value site canopy precip
A 2016 13.2 1 13.3 91.7
B 2016 14.6 1 13.3 91.7
C 2016 27.1 2 26.8 78.1
A 2017 26.4 1 13.3 91.7
B 2017 15.2 1 13.3 91.7
C 2017 31.3 2 26.8 78.1

I. Rules of the road for tidy data

2) One table for each level of observation

Do this:

subject year value site
A 2016 13.2 1
B 2016 14.6 1
C 2016 27.1 2
A 2017 26.4 1
B 2017 15.2 1
C 2017 31.3 2
site canopy precip
1 13.3 91.7
2 26.8 78.1

I. Rules of the road for tidy data

3) Always start with raw data … no derived variables!

Don't do this:

subject date year value
A 2016-06-12 2016 13.2
B 2016-06-17 2016 14.6
C 2016-07-01 2016 27.1
A 2017-06-14 2017 26.4
B 2017-06-18 2017 15.2
C 2017-06-29 2017 31.3

I. Rules of the road for tidy data

3) Always start with raw data … no derived variables!

Do this:

subject date value
A 2016-06-12 13.2
B 2016-06-17 14.6
C 2016-07-01 27.1
A 2017-06-14 26.4
B 2017-06-18 15.2
C 2017-06-29 31.3

I. Rules of the road for tidy data

4) One data value per field

Don't do this:

subject value sexYear
A 13.2 m2016
B 14.6 f2016
C 27.1 f2016
A 26.4 m2017
B 15.2 f2017
C 31.3 f2017

I. Rules of the road for tidy data

4) One data value per field

Don't do this:

subject year value sexYear
A 2016 13.2 m
B 2016 14.6 f
C 2016 27.1 f
A 2017 26.4 m
B 2017 15.2 f
C 2017 31.3 f

II. The tidyverse: Tibbles and pipes oh my!

alt text alt text alt text

install.packages('tidyverse')
install.packages('stringr')
install.packages('lubridate')

II. The tidyverse: Tibbles and pipes oh my!

Tibbles:

  • Show a maximum of 10 rows for long data tables
  • Show a reduced number of columns, if necessary
  • Provide the dimensions of the data table
  • Provide the class of fields in a data frame
  subject year value
1       A 2016  13.2
2       B 2016  14.6
3       C 2016  27.1
4       A 2017  26.4
5       B 2017  15.2
6       C 2017  31.3

II. The tidyverse: Tibbles and pipes oh my!

Tibbles:

  • Show a maximum of 10 rows for long data tables
  • Show a reduced number of columns, if necessary
  • Provide the dimensions of the data table
  • Provide the class of fields in a data frame
# A tibble: 6 × 3
  subject  year value
   <fctr> <chr> <dbl>
1       A  2016  13.2
2       B  2016  14.6
3       C  2016  27.1
4       A  2017  26.4
5       B  2017  15.2
6       C  2017  31.3

II. The tidyverse: Tibbles and pipes oh my!

Tibbles:

  • Show a maximum of 10 rows for long data tables
  • Show a reduced number of columns, if necessary
  • Provide the dimensions of the data table
  • Provide the class of fields in a data frame
  subject year value
1       A 2016  13.2
2       B 2016  14.6
3       C 2016  27.1
4       A 2017  26.4
5       B 2017  15.2
6       C 2017  31.3
# A tibble: 6 × 3
  subject  year value
   <fctr> <chr> <dbl>
1       A  2016  13.2
2       B  2016  14.6
3       C  2016  27.1
4       A  2017  26.4
5       B  2017  15.2
6       C  2017  31.3

II. The tidyverse: Tibbles and pipes oh my!

The Pipe operator (%>%) allows you to pass output from an argument to another argument without assigning a name or nesting functions.

For example, we can make use the tbl_df function and a pipe to turn a regular data frame to a tibble:

dataFrame %>%
  tbl_df

II. The tidyverse: Tibbles and pipes oh my!

The Pipe operator (%>%) allows you to pass output from an argument to another argument without assigning a name or nesting functions.

For example, we can make use the tbl_df function and a pipe to turn a regular data frame to a tibble:

dataFrame %>%
  tbl_df

Note the convention to start a new line after each pipe. This is to make your code more readible.

IV. The tidyverse: Tibbles and pipes oh my!

We can read a data table into R directly as a tibble using the readr function read_csv. For today's work, we will read in files from GitHub. To do so, we will use the package RCurl to read in the data from the web.

# Get URL for website:
gitSite <- 'https://raw.githubusercontent.com/bsevansunc/rWorkshop/master/'

# Paste URL to the file names:

dirtyBirdURL <- getURL(paste0(gitSite, 'dirtyBirdData','.csv'))

dirtyBandingURL <- getURL(paste0(gitSite, 'dirtyBandingData','.csv'))

dirtyResightURL <- getURL(paste0(gitSite, 'dirtyResightData','.csv'))

# Read in the tibbles:

dirtyBird <- read_csv(dirtyBirdURL)

dirtyBanding <- read_csv(dirtyBandingURL)

dirtyResight <- read_csv(dirtyResightURL)
  • Reading in tibbles
  • Manipulating strings using stringr
  • Combining data tables using join and bind
  • Working with dates
  • Selecting fields (review)
  • Subsetting with %in%
  • Summarizing data (review and new)