Advanced data wrangling

Goals of this lesson

Don't be a dirty data maker!
Introduction to the tidyverse
Reading in tibbles
Manipulating strings using stringr
Combining data tables using join and bind
Working with dates
Selecting fields (review)
Subsetting with %in%
Summarizing data (review and new)

I. Rules of the road for tidy data

1) Store all data in long format
2) One table for each level of observation
3) Always start with raw data … no derived variables!
4) One data value per field

I. Rules of the road for tidy data

1) Store all data in long format

Wide format data:

subject	mass2016	mass2017
A	13.2	26.4
B	14.6	15.2
C	27.1	31.3

I. Rules of the road for tidy data

1) Store all data in long format

Long format data:

subject	year	value
A	2016	13.2
B	2016	14.6
C	2016	27.1
A	2017	26.4
B	2017	15.2
C	2017	31.3

I. Rules of the road for tidy data

2) One table for each level of observation

Don't do this:

subject	year	value	site	canopy	precip
A	2016	13.2	1	13.3	91.7
B	2016	14.6	1	13.3	91.7
C	2016	27.1	2	26.8	78.1
A	2017	26.4	1	13.3	91.7
B	2017	15.2	1	13.3	91.7
C	2017	31.3	2	26.8	78.1

I. Rules of the road for tidy data

2) One table for each level of observation

Do this:

subject	year	value	site
A	2016	13.2	1
B	2016	14.6	1
C	2016	27.1	2
A	2017	26.4	1
B	2017	15.2	1
C	2017	31.3	2

site	canopy	precip
1	13.3	91.7
2	26.8	78.1

I. Rules of the road for tidy data

3) Always start with raw data … no derived variables!

Don't do this:

subject	date	year	value
A	2016-06-12	2016	13.2
B	2016-06-17	2016	14.6
C	2016-07-01	2016	27.1
A	2017-06-14	2017	26.4
B	2017-06-18	2017	15.2
C	2017-06-29	2017	31.3

I. Rules of the road for tidy data

3) Always start with raw data … no derived variables!

Do this:

subject	date	value
A	2016-06-12	13.2
B	2016-06-17	14.6
C	2016-07-01	27.1
A	2017-06-14	26.4
B	2017-06-18	15.2
C	2017-06-29	31.3

I. Rules of the road for tidy data

4) One data value per field

Don't do this:

subject	value	sexYear
A	13.2	m2016
B	14.6	f2016
C	27.1	f2016
A	26.4	m2017
B	15.2	f2017
C	31.3	f2017

I. Rules of the road for tidy data

4) One data value per field

Don't do this:

subject	year	value	sexYear
A	2016	13.2	m
B	2016	14.6	f
C	2016	27.1	f
A	2017	26.4	m
B	2017	15.2	f
C	2017	31.3	f

II. The tidyverse: Tibbles and pipes oh my!

install.packages('tidyverse')
install.packages('stringr')
install.packages('lubridate')

II. The tidyverse: Tibbles and pipes oh my!

Tibbles:

Show a maximum of 10 rows for long data tables
Show a reduced number of columns, if necessary
Provide the dimensions of the data table
Provide the class of fields in a data frame

  subject year value
1       A 2016  13.2
2       B 2016  14.6
3       C 2016  27.1
4       A 2017  26.4
5       B 2017  15.2
6       C 2017  31.3

II. The tidyverse: Tibbles and pipes oh my!

Tibbles:

Show a maximum of 10 rows for long data tables
Show a reduced number of columns, if necessary
Provide the dimensions of the data table
Provide the class of fields in a data frame

# A tibble: 6 × 3
  subject  year value
   <fctr> <chr> <dbl>
1       A  2016  13.2
2       B  2016  14.6
3       C  2016  27.1
4       A  2017  26.4
5       B  2017  15.2
6       C  2017  31.3

II. The tidyverse: Tibbles and pipes oh my!

Tibbles:

Show a maximum of 10 rows for long data tables
Show a reduced number of columns, if necessary
Provide the dimensions of the data table
Provide the class of fields in a data frame

  subject year value
1       A 2016  13.2
2       B 2016  14.6
3       C 2016  27.1
4       A 2017  26.4
5       B 2017  15.2
6       C 2017  31.3

# A tibble: 6 × 3
  subject  year value
   <fctr> <chr> <dbl>
1       A  2016  13.2
2       B  2016  14.6
3       C  2016  27.1
4       A  2017  26.4
5       B  2017  15.2
6       C  2017  31.3

II. The tidyverse: Tibbles and pipes oh my!

The Pipe operator (%>%) allows you to pass output from an argument to another argument without assigning a name or nesting functions.

For example, we can make use the tbl_df function and a pipe to turn a regular data frame to a tibble:

dataFrame %>%
  tbl_df

II. The tidyverse: Tibbles and pipes oh my!

The Pipe operator (%>%) allows you to pass output from an argument to another argument without assigning a name or nesting functions.

For example, we can make use the tbl_df function and a pipe to turn a regular data frame to a tibble:

dataFrame %>%
  tbl_df

Note the convention to start a new line after each pipe. This is to make your code more readible.

IV. The tidyverse: Tibbles and pipes oh my!

We can read a data table into R directly as a tibble using the readr function read_csv. For today's work, we will read in files from GitHub. To do so, we will use the package RCurl to read in the data from the web.

# Get URL for website:
gitSite <- 'https://raw.githubusercontent.com/bsevansunc/rWorkshop/master/'

# Paste URL to the file names:

dirtyBirdURL <- getURL(paste0(gitSite, 'dirtyBirdData','.csv'))

dirtyBandingURL <- getURL(paste0(gitSite, 'dirtyBandingData','.csv'))

dirtyResightURL <- getURL(paste0(gitSite, 'dirtyResightData','.csv'))

# Read in the tibbles:

dirtyBird <- read_csv(dirtyBirdURL)

dirtyBanding <- read_csv(dirtyBandingURL)

dirtyResight <- read_csv(dirtyResightURL)

Reading in tibbles
Manipulating strings using stringr
Combining data tables using join and bind
Working with dates
Selecting fields (review)
Subsetting with %in%
Summarizing data (review and new)