logo

1 Special data types


1.1 Strings

Strings are a special type of data. They are always classified as a character. When working with strings it is common to use regular expressions, or regexps. Strings usually contain unstructured or semi-structured data, and regexps are a concise language to describe patterns in strings. We will focus on the stringer package that is included in the core tidyverse.

1.1.1 The basics

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.5     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# Make a string with "" or ''
(string1 <- "This is a string")
## [1] "This is a string"
(string2 <- 'If I want to include a "quote" inside a string, I use single quotes')
## [1] "If I want to include a \"quote\" inside a string, I use single quotes"

(string3 <- "There are also many special \ncharacters in R, like \u00b5")
## [1] "There are also many special \ncharacters in R, like µ"

# See more with ?'"'

# String length
str_length(c("a", "R for data science", NA))
## [1]  1 18 NA

# Combining strings
str_c("x", "yz")
## [1] "xyz"

# Control how they are seperated
str_c("x", "yz", sep = ", ")
## [1] "x, yz"

# More complex
str_c("prefix-", c("a", "b", "c"), "-suffix")
## [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"

# Extract parts of a string
x <- c("Apple", "Banana", "Orange")

str_sub(x, start = 1, end = 3)
## [1] "App" "Ban" "Ora"

str_sub(x, start = -3, end = -1)
## [1] "ple" "ana" "nge"

# You can also use the subest function to alter the text at a specific location
str_to_lower(str_sub(x, 1, 1))
## [1] "a" "b" "o"

Regular expressions are complicated, but can be useful once you understand how to use them. We will use str_view() and str_view_all() to show how a character vector and regular expression match up.

We will start with some basic matching.

# Matching exact letters
x <- c("apple", "banana", "pear")
str_view(x, "an")
# If you want to match any character, you use . (dot)
str_view(x, ".a.")

But what if you want to match "." in a string, such as x.y. We need to use the backslash \ to escape special behavior. Unfortunately, \ also has special behavior, therefore to match to "." in a string, you need to write \\.

# Matching to special behaviors, the dot
x <- c("sats", "a.x", "hat")
str_view(x, "a.")
str_view(x, "a\\.")
str_view(x, "a\\..")
# For a single backslash
x <- "a\\b"
writeLines(x)
## a\b
str_view(x, "\\\\")

Sometimes you might want to match expressions at the beginning or end of a string. ^ will match to the beginning of the string, and $ will match to the end.

# Match to the beginning
x <- c("apple", "banana", "pear")
str_view(x, "^a")
# Match to the end
x <- c("apple", "banana", "pear")
str_view(x, "a$")
# Match a complete string by anchoring the beginning and end 
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")

We have already seen a special character, ., that matches any character. There are 4 more that are useful in R:

  • \d matches any digit
  • \s matches any white space, such as space, tab, newline
  • [abc] matches a, b, OR c
  • [^abc] matches anything except a, b, OR c
x <- c("apple123", "a4_pie3", "ab*cat", "a.c", "a c")

str_view(x, "\\d")
str_view(x, "\\s")
str_view(x, "[abc]")
str_view(x, "[^abc]")
# You can also contain some special characters, like . , inside [] to match them
str_view(x, "a[.]c")
str_view(x, ".[*].")
str_view(x, "a[ ]")

You can also pick between one or more alternative patterns. For example, "app|ear" to either "app" or "ear". You can also use () to make it clear what you want to match.

str_view(c("grey", "gray"), "gr(e|a)y")

The next level up in complexity is controlling how many times a pattern matches. ? will match to 0 or 1, + will match to 1 or more, and * will match to 0 or more.

x <- "MDCCCLXXXVIII"
str_view(x, "CC")
str_view(x, "CC?")
str_view(x, "CC+")
str_view(x, "C[LX]")
str_view(x, "C+[LX]")
str_view(x, "C[LX]+")
# You can also match to a specific number of cases
str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")
# Laxy matching, i.e., the shortest string
str_view(x, "C{2,3}?")
str_view(x, 'C[LX]+?')

Parentheses can also be used to create a numbered capturing group. A capturing group stores the part of the string that was matched by the regular expression inside the parentheses. You can then refer to these groups with back references.

# This will find all fruits that have a pair of repeated letters
# (..) finds all two letter combinations, \\1 is saying look at the two letter
# combinations, and by combining it you are telling R to match to 2 letters AND 
# the back reference
str_view(fruit, "(..)\\1", match = TRUE)

1.1.2 Applying basics to real problems

Now that you have learned the basics of strings and regular expressions, it is time to apply what you have learned to manipulate data by determining which strings match a pattern, find position of matches, extract or replace matches, and split a string based on a match.

It is relatively easy to determine if characters in a vector matches a pattern using str_detect().

x <- c("apple", "banana", "pear")

str_detect(x, "e")
## [1]  TRUE FALSE  TRUE

# You can also combine with sum() or mean() to answer questions about a large vector

# How many common words start with t?
sum(str_detect(words, "^t"))
## [1] 65

# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
## [1] 0.2765306

# You can also negate the function to do the opposite
# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(words, "[aeiou]")

# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")

# These are the same thing, but the first is way easier
identical(no_vowels_1, no_vowels_2)
## [1] TRUE

# Sometimes you might want to know the number of matches in a string
str_count(x, "a")
## [1] 1 3 1

# Or the average number of vowels per word
mean(str_count(words, "[aeiou]"))
## [1] 1.991837

str_detect() is an easy way to select elements that match a pattern, but an easier way to subset is the str_subset() wrapper:

# Pull out words that end in x
str_subset(words, "x$")
## [1] "box" "sex" "six" "tax"

# When you are working in a data frame you will want to filter instead
df <- data.frame(i = seq_along(words), word = words)

df %>% filter(str_detect(word, "x$"))
##     i word
## 1 108  box
## 2 747  sex
## 3 772  six
## 4 841  tax

You should note that matches never overlap. For example, in "abababa", how many times will the pattern "aba" match? Regular expressions say two, not three:

str_count("abababa", "aba")
## [1] 2

str_view_all("abababa", "aba")

To extract the actual text of a match, we use str_extract()

# Look at sentences provided in the stringer package
length(sentences)
## [1] 720
head(sentences)
## [1] "The birch canoe slid on the smooth planks." 
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."     
## [4] "These days a chicken leg is a rare dish."   
## [5] "Rice is often served in round bowls."       
## [6] "The juice of lemons makes fine punch."

# If we want to find all sentences that contain a color, we can create a vector
colours <- c("red", "orange", "yellow", "green", "blue", "purple")

# Then turn it into a regexp using str_c()
colour_match <- str_c(colours, collapse = "|")
colour_match
## [1] "red|orange|yellow|green|blue|purple"

# Now we can select sentences that contain a color and extract the color
has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
## [1] "blue" "blue" "red"  "red"  "red"  "blue"

# Note that it only extracts the first match. We can get all matches by using 
# str_extract_all()
matches <- str_extract_all(has_colour, colour_match)

We can also worked with grouped strings to extract parts of a complex match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we’ll look for any word that comes after “a” or “the” (i.e., "a|the ". Defining a “word” in a regular expression is a little tricky, so here I use a simple approximation: a sequence of at least one character "+" that isn’t a space "[^ ]".

# Define our pattern using () to group them
noun <- "(a|the) ([^ ]+)"

# Get the first 10 sentences that have matches
has_noun <- sentences %>%
  str_subset(noun) %>%
  head(10)

# Extract the matches
has_noun %>% 
  str_extract(noun)
##  [1] "the smooth" "the sheet"  "the depth"  "a chicken"  "the parked"
##  [6] "the sun"    "the huge"   "the ball"   "the woman"  "a helps"

# Get each individual component
has_noun %>%
  str_match(noun)
##       [,1]         [,2]  [,3]     
##  [1,] "the smooth" "the" "smooth" 
##  [2,] "the sheet"  "the" "sheet"  
##  [3,] "the depth"  "the" "depth"  
##  [4,] "a chicken"  "a"   "chicken"
##  [5,] "the parked" "the" "parked" 
##  [6,] "the sun"    "the" "sun"    
##  [7,] "the huge"   "the" "huge"   
##  [8,] "the ball"   "the" "ball"   
##  [9,] "the woman"  "the" "woman"  
## [10,] "a helps"    "a"   "helps"

You can also replace matches with new strings using str_replace() and str_replace_all().

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
## [1] "-pple"  "p-ar"   "b-nana"

str_replace_all(x, "[aeiou]", "-")
## [1] "-ppl-"  "p--r"   "b-n-n-"


# You can perform multiple replacements by supplying a vector
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
## [1] "one house"    "two cars"     "three people"


# You can also use backreferences to insert components of the match. Flip order of 
# second and third words
sentences %>%
  head(5)
## [1] "The birch canoe slid on the smooth planks." 
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."     
## [4] "These days a chicken leg is a rare dish."   
## [5] "Rice is often served in round bowls."

sentences %>%
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
  head(5)
## [1] "The canoe birch slid on the smooth planks." 
## [2] "Glue sheet the to the dark blue background."
## [3] "It's to easy tell the depth of a well."     
## [4] "These a days chicken leg is a rare dish."   
## [5] "Rice often is served in round bowls."

Another useful function is str_split(), which splits a string up into pieces.

# Split a sentence into words
sentences %>%
  head(5) %>%
  str_split(" ")
## [[1]]
## [1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
## [8] "planks."
## 
## [[2]]
## [1] "Glue"        "the"         "sheet"       "to"          "the"        
## [6] "dark"        "blue"        "background."
## 
## [[3]]
## [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."
## 
## [[4]]
## [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
## [8] "rare"    "dish."  
## 
## [[5]]
## [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."

# This returns a list, with each sentence as a list component. You can return a 
# matrix instead
sentences %>%
  head(5) %>%
  str_split(" ", simplify = TRUE)
##      [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]         
## [1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."    
## [2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background."
## [3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"          
## [4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"       
## [5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls." ""           
##      [,9]   
## [1,] ""     
## [2,] ""     
## [3,] "well."
## [4,] "dish."
## [5,] ""

# You can request a max number of pieces
sentences %>%
  head(5) %>%
  str_split(" ", n = 2, simplify = TRUE)
##      [,1]    [,2]                                    
## [1,] "The"   "birch canoe slid on the smooth planks."
## [2,] "Glue"  "the sheet to the dark blue background."
## [3,] "It's"  "easy to tell the depth of a well."     
## [4,] "These" "days a chicken leg is a rare dish."    
## [5,] "Rice"  "is often served in round bowls."

When dealing with data that has been entered in excel, you will likely come across inconsistent nomenclature. The simplest case of this can be differences in use of upper or lowercase. Regex takes other comments that can help solve some of these problems.

# You can ignore capitalization
bs <- c("banana", "Banana", "BANANA")
str_view_all(bs, "banana")

str_view_all(bs, regex("banana", ignore_case = TRUE))
# You can match a string to file names. dir() lists all files in a directory
dir(pattern = "\\.R$")
## character(0)

1.2 Date and time

With spatial data you will undoubtedly encounter date/time data. These can be a headache to work with, and there are a few different packages to select from. We will be working in lubridate.

There are 3 different types of date/time data: date, time, and date-time (also known as POSIXct). There are a few ways in which you might create date/time data: from strings, from individual date-time components, or from a date/time object.

When you load an excel sheet into R with date/time data, these are often strings. Lubridate has helper functions to parse strings into date/time data. They automatically work out the date format once you specify the order.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(nycflights13)

# Year, Month, Day order
ymd("2017-01-31")
## [1] "2017-01-31"

# Month, Day, Year order
mdy("January 31st, 2017")
## [1] "2017-01-31"

# Day, Month, Year order
dmy("31-Jan-2017")
## [1] "2017-01-31"

# For date-time, add an underscore with h, m, s
ymd_hms("2017-01-31 20:11:59")
## [1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
## [1] "2017-01-31 08:01:00 UTC"

When you work outside of the lubridate package, you may need to specify your date-time format. The format is constructed from the following pieces: Year %Y (4 digits). %y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999. Month %m (2 digits). %b (abbreviated name, like “Jan”). %B (full name, “January”). Day %d (2 digits). %e (optional leading space). Time %H 0-23 hour. %I 0-12, must be used with %p. %p AM/PM indicator. %M minutes. %S integer seconds. %OS real seconds. %Z Time zone (as name, e.g. America/Chicago).

parse_date("01/02/15", "%m/%d/%y")
## [1] "2015-01-02"

parse_date("01/02/15", "%d/%m/%y")
## [1] "2015-02-01"

parse_date("01/02/15", "%y/%m/%d")
## [1] "2001-02-15"

Instead of a string, you might have individual components spread across multiple columns.

flights %>%
  select(year, month, day, hour, minute)
## # A tibble: 336,776 x 5
##     year month   day  hour minute
##    <int> <int> <int> <dbl>  <dbl>
##  1  2013     1     1     5     15
##  2  2013     1     1     5     29
##  3  2013     1     1     5     40
##  4  2013     1     1     5     45
##  5  2013     1     1     6      0
##  6  2013     1     1     5     58
##  7  2013     1     1     6      0
##  8  2013     1     1     6      0
##  9  2013     1     1     6      0
## 10  2013     1     1     6      0
## # ... with 336,766 more rows

To create a date/time from this, we use make_date() for dates, or make_datetime() for date-times.

flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_datetime(year, month, day, hour, minute))
## # A tibble: 336,776 x 6
##     year month   day  hour minute departure          
##    <int> <int> <int> <dbl>  <dbl> <dttm>             
##  1  2013     1     1     5     15 2013-01-01 05:15:00
##  2  2013     1     1     5     29 2013-01-01 05:29:00
##  3  2013     1     1     5     40 2013-01-01 05:40:00
##  4  2013     1     1     5     45 2013-01-01 05:45:00
##  5  2013     1     1     6      0 2013-01-01 06:00:00
##  6  2013     1     1     5     58 2013-01-01 05:58:00
##  7  2013     1     1     6      0 2013-01-01 06:00:00
##  8  2013     1     1     6      0 2013-01-01 06:00:00
##  9  2013     1     1     6      0 2013-01-01 06:00:00
## 10  2013     1     1     6      0 2013-01-01 06:00:00
## # ... with 336,766 more rows

You may also want to switch between dates and date-times.

today()
## [1] "2021-01-26"
as_datetime(today())
## [1] "2021-01-26 UTC"

now()
## [1] "2021-01-26 09:57:54 CST"
as_date(now())
## [1] "2021-01-26"

Once you have date/time data in R, you can pull out individual parts with assessor functions like year() or wday() (day of week).

datetime <- ymd_hms("2016-07-08 12:34:56")

year(datetime)
## [1] 2016
month(datetime)
## [1] 7
mday(datetime)
## [1] 8

yday(datetime)
## [1] 190
wday(datetime)
## [1] 6
wday(datetime, label = TRUE, abbr = FALSE)
## [1] Friday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

# You can also set the components with these functions
year(datetime) <- 2020
datetime
## [1] "2020-07-08 12:34:56 UTC"

month(datetime) <- 01
datetime
## [1] "2020-01-08 12:34:56 UTC"

hour(datetime) <- hour(datetime) + 1
datetime
## [1] "2020-01-08 13:34:56 UTC"

# More efficiently, use update() to set multiple values at once
update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
## [1] "2020-02-02 02:34:56 UTC"

Sometimes you may need to round a date to a nearby unit of time.

head(flights)
## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

make_datetime_100 <- function(year, month, day, time) {
  make_datetime(year, month, day, time %/% 100, time %% 100)
}

# Make a new DF to hold our date-time data
flights_dt <- flights %>% 
  filter(!is.na(dep_time), !is.na(arr_time)) %>% 
  dplyr::mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) %>% 
  dplyr::select(origin, dest, ends_with("delay"), ends_with("time"))

flights_dt %>% head()
## # A tibble: 6 x 9
##   origin dest  dep_delay arr_delay dep_time            sched_dep_time     
##   <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
## 1 EWR    IAH           2        11 2013-01-01 05:17:00 2013-01-01 05:15:00
## 2 LGA    IAH           4        20 2013-01-01 05:33:00 2013-01-01 05:29:00
## 3 JFK    MIA           2        33 2013-01-01 05:42:00 2013-01-01 05:40:00
## 4 JFK    BQN          -1       -18 2013-01-01 05:44:00 2013-01-01 05:45:00
## 5 LGA    ATL          -6       -25 2013-01-01 05:54:00 2013-01-01 06:00:00
## 6 EWR    ORD          -4        12 2013-01-01 05:54:00 2013-01-01 05:58:00
## # ... with 3 more variables: arr_time <dttm>, sched_arr_time <dttm>,
## #   air_time <dbl>

flights_dt %>% 
  count(weekf = floor_date(dep_time, "week"), weekc = ceiling_date(dep_time, "week")) %>%
  mutate(diffweek = weekc - weekf)
## # A tibble: 56 x 4
##    weekf               weekc                   n diffweek   
##    <dttm>              <dttm>              <int> <drtn>     
##  1 2012-12-30 00:00:00 2013-01-06 00:00:00  4300 604800 secs
##  2 2013-01-06 00:00:00 2013-01-13 00:00:00  6082 604800 secs
##  3 2013-01-13 00:00:00 2013-01-20 00:00:00  5976 604800 secs
##  4 2013-01-20 00:00:00 2013-01-27 00:00:00  5925 604800 secs
##  5 2013-01-27 00:00:00 2013-02-03 00:00:00  5774 604800 secs
##  6 2013-02-03 00:00:00 2013-02-10 00:00:00  5160 604800 secs
##  7 2013-02-10 00:00:00 2013-02-17 00:00:00  6081 604800 secs
##  8 2013-02-17 00:00:00 2013-02-24 00:00:00  6253 604800 secs
##  9 2013-02-24 00:00:00 2013-03-03 00:00:00  6275 604800 secs
## 10 2013-03-03 00:00:00 2013-03-10 00:00:00  6027 604800 secs
## # ... with 46 more rows

The next thing we need to understand and work with are time spans. Often you may need to figure out how many days, weeks, hours, etc. have passed, or you may want to filter out dates that do not fall within a certain period or interval.

The duration of a time span represents an exact number of seconds. When you subtract two dates in R, you get a difftime object.

# How old am I today? 
(my_age <- today() - ymd(19881122))
## Time difference of 11753 days
str(my_age)
##  'difftime' num 11753
##  - attr(*, "units")= chr "days"

# Lubradate always reports duration in seconds
as.duration(my_age)
## [1] "1015459200s (~32.18 years)"

# Create a duration in seconds
dseconds(15)
## [1] "15s"

# In minutes...
dminutes(10)
## [1] "600s (~10 minutes)"

# All the way up to years
dyears(c(1,2))
## [1] "31557600s (~1 years)" "63115200s (~2 years)"

# Use basic math with durations
2 * dyears(1)
## [1] "63115200s (~2 years)"
today() + ddays(10)
## [1] "2021-02-05"

The problem is because duration represent an exact number of seconds, sometimes you can get unexpected results. Because of daylight savings time some days only have 23 hours or have up to 25 hours. Instead, you can use periods in lubridate. Periods are spans of time that don't have a fixed length in seconds. Instead, we can work in days, months, or years.

one_pm <- ymd_hms("2016-03-12 13:00:00", tz = "America/New_York")

one_pm
## [1] "2016-03-12 13:00:00 EST"
one_pm + ddays(1)
## [1] "2016-03-13 14:00:00 EDT"

# Now work in periods
one_pm + days(1)
## [1] "2016-03-13 13:00:00 EDT"

# Again, you can create periods
seconds(12)
## [1] "12S"

# All the way up to years
years(3)
## [1] "3y 0m 0d 0H 0M 0S"

# You can add and multiply
10 * (months(6) + days(1))
## [1] "60m 10d 0H 0M 0S"
days(50) + hours(25) + minutes(2)
## [1] "50d 25H 2M 0S"

The last time span you need to understand is an interval. What should years(1) / days(1) return? It will either return 365 or 366, depending on if it is a leap year.

years(1) / days(1)
## [1] 365.25

If you want a more accurate measurement, you will have to use an interval and set the starting point. Intervals in lubridate are denoted by start %--% end.

next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)
## [1] 365

# Another option
interval(start = today(), end = next_year) / ddays(1)
## [1] 365

# We can set an interval
(my_interval <- interval(start = today(), end = next_year))
## [1] 2021-01-26 UTC--2022-01-26 UTC

# We can flip it
int_flip(my_interval)
## [1] 2022-01-26 UTC--2021-01-26 UTC

# Shift it
int_shift(my_interval, ddays(20))
## [1] 2021-02-15 UTC--2022-02-15 UTC

# Find out how many days in each interval using integer division
(today() %--% next_year) %/% days(1)
## [1] 365

# You can also take a vector of date-time and get the interval between each element
head(flights_dt$dep_time)
## [1] "2013-01-01 05:17:00 UTC" "2013-01-01 05:33:00 UTC"
## [3] "2013-01-01 05:42:00 UTC" "2013-01-01 05:44:00 UTC"
## [5] "2013-01-01 05:54:00 UTC" "2013-01-01 05:54:00 UTC"
head(int_diff(flights_dt$dep_time))
## [1] 2013-01-01 05:17:00 UTC--2013-01-01 05:33:00 UTC
## [2] 2013-01-01 05:33:00 UTC--2013-01-01 05:42:00 UTC
## [3] 2013-01-01 05:42:00 UTC--2013-01-01 05:44:00 UTC
## [4] 2013-01-01 05:44:00 UTC--2013-01-01 05:54:00 UTC
## [5] 2013-01-01 05:54:00 UTC--2013-01-01 05:54:00 UTC
## [6] 2013-01-01 05:54:00 UTC--2013-01-01 05:55:00 UTC

The last piece of information we need about Date/time data is time zones. R uses IANA time zones which use a consistent naming scheme. You can find out what timezone you are currently in using Sys.timezone(). You can change what time zone is displayed or what time zone is used.

(x1 <- ymd_hms("2015-06-01 12:00:00", tz = "America/New_York"))
## [1] "2015-06-01 12:00:00 EDT"
(x2 <- ymd_hms("2015-06-01 18:00:00", tz = "Europe/Copenhagen"))
## [1] "2015-06-01 18:00:00 CEST"
(x3 <- ymd_hms("2015-06-02 04:00:00", tz = "Pacific/Auckland"))
## [1] "2015-06-02 04:00:00 NZST"
x4 <- c(x1, x2, x3)

# Change what is displayed
x4a <- with_tz(x4, tzone = "Australia/Lord_Howe")
x4a
## [1] "2015-06-02 02:30:00 +1030" "2015-06-02 02:30:00 +1030"
## [3] "2015-06-02 02:30:00 +1030"

x4a - x4
## Time differences in secs
## [1] 0 0 0


# Change the underlying time
x4b <- force_tz(x4, tzone = "Australia/Lord_Howe")
x4b
## [1] "2015-06-01 12:00:00 +1030" "2015-06-01 12:00:00 +1030"
## [3] "2015-06-01 12:00:00 +1030"

x4b - x4
## Time differences in hours
## [1] -14.5 -14.5 -14.5

2 Reshaping data


Once you have imported data it is sometimes necessary to arrange it in a certain order or reshape it from long to wide format. Although there are base functions in R that can arrange variables, we will be working exclusively with dplyr and tidyr.

arrange() orders observations according to the values for a specific variable. The default is to order numbers from lowest to highest.

library(dplyr)
library(tidyverse)

# First we will look at our example database 
head(mpg)
## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~
str(mpg)
## tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...


# We will order our data by year and highway mpg (hwy)
arrange(mpg, year, hwy)
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 chevrolet    k1500 tah~   5.7  1999     8 auto~ 4        11    15 r     suv  
##  2 dodge        dakota pi~   5.2  1999     8 auto~ 4        11    15 r     pick~
##  3 dodge        durango 4~   5.9  1999     8 auto~ 4        11    15 r     suv  
##  4 dodge        ram 1500 ~   5.2  1999     8 auto~ 4        11    15 r     pick~
##  5 dodge        ram 1500 ~   5.9  1999     8 auto~ 4        11    15 r     pick~
##  6 ford         f150 pick~   5.4  1999     8 auto~ 4        11    15 r     pick~
##  7 land rover   range rov~   4    1999     8 auto~ 4        11    15 p     suv  
##  8 land rover   range rov~   4.6  1999     8 auto~ 4        11    15 p     suv  
##  9 toyota       land crui~   4.7  1999     8 auto~ 4        11    15 r     suv  
## 10 dodge        durango 4~   5.2  1999     8 auto~ 4        11    16 r     suv  
## # ... with 224 more rows

# This is the same as above
mpg %>% arrange(year, hwy)
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 chevrolet    k1500 tah~   5.7  1999     8 auto~ 4        11    15 r     suv  
##  2 dodge        dakota pi~   5.2  1999     8 auto~ 4        11    15 r     pick~
##  3 dodge        durango 4~   5.9  1999     8 auto~ 4        11    15 r     suv  
##  4 dodge        ram 1500 ~   5.2  1999     8 auto~ 4        11    15 r     pick~
##  5 dodge        ram 1500 ~   5.9  1999     8 auto~ 4        11    15 r     pick~
##  6 ford         f150 pick~   5.4  1999     8 auto~ 4        11    15 r     pick~
##  7 land rover   range rov~   4    1999     8 auto~ 4        11    15 p     suv  
##  8 land rover   range rov~   4.6  1999     8 auto~ 4        11    15 p     suv  
##  9 toyota       land crui~   4.7  1999     8 auto~ 4        11    15 r     suv  
## 10 dodge        durango 4~   5.2  1999     8 auto~ 4        11    16 r     suv  
## # ... with 224 more rows

# Notice this did not change our actual data, that is because we did not save it
head(mpg)
## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~

# If we want our new order to stay, we can save it as such
mpg <- mpg %>% arrange(year, hwy)
head(mpg)
## # A tibble: 6 x 11
##   manufacturer model      displ  year   cyl trans  drv     cty   hwy fl    class
##   <chr>        <chr>      <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
## 1 chevrolet    k1500 tah~   5.7  1999     8 auto(~ 4        11    15 r     suv  
## 2 dodge        dakota pi~   5.2  1999     8 auto(~ 4        11    15 r     pick~
## 3 dodge        durango 4~   5.9  1999     8 auto(~ 4        11    15 r     suv  
## 4 dodge        ram 1500 ~   5.2  1999     8 auto(~ 4        11    15 r     pick~
## 5 dodge        ram 1500 ~   5.9  1999     8 auto(~ 4        11    15 r     pick~
## 6 ford         f150 pick~   5.4  1999     8 auto(~ 4        11    15 r     pick~


# What about characters? Let's look 
mpg %>% arrange(manufacturer, model)
## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
##  1 audi         a4         2.8  1999     6 auto(l~ f        16    26 p     comp~
##  2 audi         a4         2.8  1999     6 manual~ f        18    26 p     comp~
##  3 audi         a4         1.8  1999     4 auto(l~ f        18    29 p     comp~
##  4 audi         a4         1.8  1999     4 manual~ f        21    29 p     comp~
##  5 audi         a4         3.1  2008     6 auto(a~ f        18    27 p     comp~
##  6 audi         a4         2    2008     4 auto(a~ f        21    30 p     comp~
##  7 audi         a4         2    2008     4 manual~ f        20    31 p     comp~
##  8 audi         a4 quat~   1.8  1999     4 auto(l~ 4        16    25 p     comp~
##  9 audi         a4 quat~   2.8  1999     6 auto(l~ 4        15    25 p     comp~
## 10 audi         a4 quat~   2.8  1999     6 manual~ 4        17    25 p     comp~
## # ... with 224 more rows

# We can also reverse the order with desc()
mpg %>% arrange(manufacturer, desc(model))
## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
##  1 audi         a6 quat~   2.8  1999     6 auto(l~ 4        15    24 p     mids~
##  2 audi         a6 quat~   4.2  2008     8 auto(s~ 4        16    23 p     mids~
##  3 audi         a6 quat~   3.1  2008     6 auto(s~ 4        17    25 p     mids~
##  4 audi         a4 quat~   1.8  1999     4 auto(l~ 4        16    25 p     comp~
##  5 audi         a4 quat~   2.8  1999     6 auto(l~ 4        15    25 p     comp~
##  6 audi         a4 quat~   2.8  1999     6 manual~ 4        17    25 p     comp~
##  7 audi         a4 quat~   1.8  1999     4 manual~ 4        18    26 p     comp~
##  8 audi         a4 quat~   3.1  2008     6 auto(s~ 4        17    25 p     comp~
##  9 audi         a4 quat~   3.1  2008     6 manual~ 4        15    25 p     comp~
## 10 audi         a4 quat~   2    2008     4 auto(s~ 4        19    27 p     comp~
## # ... with 224 more rows

Reshaping data is another important concept to grasp to be able to work effectively in R. Data can be organized in different ways, and some can be easier to use than others. Each data set below contains the same data and 4 variables (country, year, population size, # cases), but organized differently.

# All 4 variables have thier own column
table1
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
# The variables for population size and number of cases are now expressed as an 
# observation
table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
# The number of cases and population size are now combined into one variable, rate
table3
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

# The data are spread across two tibbles, one for # cases and the other for population
# size
table4a  # cases
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

table4b  # population
## # A tibble: 3 x 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

Hadley Whickham described 3 rules to a tidy data set:

  1. Each variable must have its own column
  2. Each observation must have its own row
  3. Each value must have its own cell

So which table above is the tidy one?

What do you do with an untidy data set? Fortunately, the tidyverse has many built in functions to help you tidy up your data set. The first step is to figure out what the variables and observations are. Then reshaping the data is simple with pivot_longer() and pivot_wider.

A common problem in some data sets is that the column name is actually the value of a variable. table4a, for example, has the column names that represent values of the variable year. Each row also contains 2 observations, not one.

table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

We can fix this by pivoting the columns into a new pair of variables. To do this, we need the columns to be pivoted, the name of the new variable to move the column names to, and the name of the variable that will hold the values from the columns.

table4a %>% 
  pivot_longer(cols = c('1999', '2000'), names_to = "Year", values_to = "Cases")
## # A tibble: 6 x 3
##   country     Year   Cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

This makes the new data set longer. You have added more rows. pivot_wider() is the opposite, it will make a wider data set. You use it when an observation is scattered across multiple rows. In table2 a single observation is a country for each year, but each observation is spread across two rows. The country and year combination (i.e., Brazil 1999) is repeated twice.

table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

To tidy this up, we need the column to take the new variable names from and the column to take the values from.

table2 %>%
  pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

These are very simple cases, but pivot_longer() and pivot_wider() are easy to use even in complex situations.

Lets take a look at a hypothetical encounter history for nest surveys of wild turkey. Lets say we have 25 nests we were following, and each day we visited the nest we recorded whether or not the nest was still active (fate = 1) or had failed (fate = 0). We also took some vegetation measurements at each nest, and recorded the age of the nesting female. What are our variables? We need nest ID to keep track of the 25 nests, fate of the nest, and day. We also have age of the nesting female, max vegetation height, and ground cover. Each nest and day has an observation, but the veg metrics don't change. Let's be real, we probably did not collect and store our data in a tidy format. That is because tidy data isn't necessarily the easiest to enter the data. I created the hypothetical data in excel, loaded it into R, then used dput() to get code to recreate the data.

# Our hypothetical data
nest_data <- structure(list(Nest_ID = 1:25, veg_max = c(14L, 36L, 37L, 83L, 
63L, 42L, 173L, 217L, 135L, 15L, 35L, 7L, 67L, 3L, 37L, 3L, 73L, 
135L, 73L, 84L, 8L, 9L, 89L, 157L, 162L), gc = c(0.23, 0.86, 
0.1, 0.4, 0.2, 0.52, 0.84, 0.26, 0.56, 0.37, 0.25, 0.84, 0.36, 
0.85, 0.25, 0.64, 0.37, 0.26, 0.36, 0.72, 0.54, 0.35, 0.25, 0.74, 
0.63), age = c("adult", "adult", "adult", "adult", "adult", "adult", 
"juvenile", "juvenile", "adult", "adult", "juvenile", "juvenile", 
"adult", "adult", "adult", "adult", "juvenile", "adult", "adult", 
"juvenile", "juvenile", "adult", "adult", "adult", "juvenile"
), Day1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Day2 = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Day3 = c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 0L, 1L, 1L), Day4 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L
), Day5 = c(1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), Day6 = c(1L, 
1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), Day7 = c(1L, 1L, 1L, 0L, 1L, 
1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 0L, 1L, 1L), Day8 = c(1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 
1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L
), Day9 = c(1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day10 = c(1L, 
1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day11 = c(1L, 1L, 1L, 0L, 1L, 
1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 
1L, 0L, 1L, 1L), Day12 = c(1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 
1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L
), Day13 = c(1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day14 = c(1L, 
1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 
1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day15 = c(1L, 1L, 1L, 0L, 1L, 
1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 
1L, 0L, 1L, 1L), Day16 = c(1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 
1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L
), Day17 = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 
1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day18 = c(1L, 
0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 
1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day19 = c(1L, 0L, 1L, 0L, 1L, 
1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 
1L, 0L, 1L, 1L), Day20 = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 
1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L
), Day21 = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 
0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day22 = c(1L, 
0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 
1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day23 = c(1L, 0L, 1L, 0L, 1L, 
0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 
1L, 0L, 1L, 1L), Day24 = c(1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 
1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L
), Day25 = c(1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 
0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day26 = c(1L, 
0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 
1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L), Day27 = c(1L, 0L, 1L, 0L, 1L, 
0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 
1L, 0L, 1L, 0L), Day28 = c(1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 
1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L
)), class = "data.frame", row.names = c(NA, -25L))

Now we can look at the data. We know that it is not tidy. The column names are actually the values of the variable, day. We can fix that with pivot_longer.

head(nest_data)
##   Nest_ID veg_max   gc   age Day1 Day2 Day3 Day4 Day5 Day6 Day7 Day8 Day9 Day10
## 1       1      14 0.23 adult    1    1    1    1    1    1    1    1    1     1
## 2       2      36 0.86 adult    1    1    1    1    1    1    1    1    1     1
## 3       3      37 0.10 adult    1    1    1    1    1    1    1    1    1     1
## 4       4      83 0.40 adult    1    1    1    1    0    0    0    0    0     0
## 5       5      63 0.20 adult    1    1    1    1    1    1    1    1    1     1
## 6       6      42 0.52 adult    1    1    1    1    1    1    1    1    1     1
##   Day11 Day12 Day13 Day14 Day15 Day16 Day17 Day18 Day19 Day20 Day21 Day22 Day23
## 1     1     1     1     1     1     1     1     1     1     1     1     1     1
## 2     1     1     1     1     1     1     0     0     0     0     0     0     0
## 3     1     1     1     1     1     1     1     1     1     1     1     1     1
## 4     0     0     0     0     0     0     0     0     0     0     0     0     0
## 5     1     1     1     1     1     1     1     1     1     1     1     1     1
## 6     1     1     1     1     1     1     1     1     1     1     1     1     0
##   Day24 Day25 Day26 Day27 Day28
## 1     1     1     1     1     1
## 2     0     0     0     0     0
## 3     1     1     1     1     1
## 4     0     0     0     0     0
## 5     1     1     1     1     1
## 6     0     0     0     0     0
str(nest_data)
## 'data.frame':    25 obs. of  32 variables:
##  $ Nest_ID: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ veg_max: int  14 36 37 83 63 42 173 217 135 15 ...
##  $ gc     : num  0.23 0.86 0.1 0.4 0.2 0.52 0.84 0.26 0.56 0.37 ...
##  $ age    : chr  "adult" "adult" "adult" "adult" ...
##  $ Day1   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Day2   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Day3   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Day4   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Day5   : int  1 1 1 0 1 1 1 1 1 1 ...
##  $ Day6   : int  1 1 1 0 1 1 1 1 1 1 ...
##  $ Day7   : int  1 1 1 0 1 1 1 1 1 1 ...
##  $ Day8   : int  1 1 1 0 1 1 1 0 1 1 ...
##  $ Day9   : int  1 1 1 0 1 1 1 0 1 1 ...
##  $ Day10  : int  1 1 1 0 1 1 1 0 1 1 ...
##  $ Day11  : int  1 1 1 0 1 1 1 0 1 1 ...
##  $ Day12  : int  1 1 1 0 1 1 1 0 1 1 ...
##  $ Day13  : int  1 1 1 0 1 1 1 0 1 1 ...
##  $ Day14  : int  1 1 1 0 1 1 1 0 1 1 ...
##  $ Day15  : int  1 1 1 0 1 1 1 0 1 1 ...
##  $ Day16  : int  1 1 1 0 1 1 1 0 1 1 ...
##  $ Day17  : int  1 0 1 0 1 1 1 0 1 1 ...
##  $ Day18  : int  1 0 1 0 1 1 1 0 1 1 ...
##  $ Day19  : int  1 0 1 0 1 1 1 0 1 1 ...
##  $ Day20  : int  1 0 1 0 1 1 1 0 1 1 ...
##  $ Day21  : int  1 0 1 0 1 1 1 0 1 1 ...
##  $ Day22  : int  1 0 1 0 1 1 1 0 1 1 ...
##  $ Day23  : int  1 0 1 0 1 0 1 0 1 1 ...
##  $ Day24  : int  1 0 1 0 1 0 1 0 1 1 ...
##  $ Day25  : int  1 0 1 0 1 0 1 0 1 1 ...
##  $ Day26  : int  1 0 1 0 1 0 1 0 1 1 ...
##  $ Day27  : int  1 0 1 0 1 0 1 0 1 1 ...
##  $ Day28  : int  1 0 1 0 1 0 1 0 1 1 ...

nest_data %>%
  pivot_longer(cols = 5:32, names_to = "day", values_to = "fate", 
               names_prefix = "Day")
## # A tibble: 700 x 6
##    Nest_ID veg_max    gc age   day    fate
##      <int>   <int> <dbl> <chr> <chr> <int>
##  1       1      14  0.23 adult 1         1
##  2       1      14  0.23 adult 2         1
##  3       1      14  0.23 adult 3         1
##  4       1      14  0.23 adult 4         1
##  5       1      14  0.23 adult 5         1
##  6       1      14  0.23 adult 6         1
##  7       1      14  0.23 adult 7         1
##  8       1      14  0.23 adult 8         1
##  9       1      14  0.23 adult 9         1
## 10       1      14  0.23 adult 10        1
## # ... with 690 more rows

You see that our data set is indeed longer. It duplicated the values for veg_max and others that did not change for each nest. We also used names_prefix to remove the word Day in front of each day number. But what if we measured veg_max for each visit?

nest_data <- nest_data %>%
  pivot_longer(cols = 5:32, names_to = "day", values_to = "fate", 
               names_prefix = "Day") %>% 
  mutate(veg_max = round(runif(n(), 0, 250), 0))

head(nest_data)
## # A tibble: 6 x 6
##   Nest_ID veg_max    gc age   day    fate
##     <int>   <dbl> <dbl> <chr> <chr> <int>
## 1       1      16  0.23 adult 1         1
## 2       1     185  0.23 adult 2         1
## 3       1     249  0.23 adult 3         1
## 4       1      55  0.23 adult 4         1
## 5       1     106  0.23 adult 5         1
## 6       1     201  0.23 adult 6         1

# We will make our data set wider, then longer again
nest_data <- nest_data %>% 
  pivot_wider(names_from = day, names_prefix = "Day", 
              values_from = c(fate, veg_max))

# This is truly hideous... but it is good practice
nest_data
## # A tibble: 25 x 59
##    Nest_ID    gc age   fate_Day1 fate_Day2 fate_Day3 fate_Day4 fate_Day5
##      <int> <dbl> <chr>     <int>     <int>     <int>     <int>     <int>
##  1       1  0.23 adult         1         1         1         1         1
##  2       2  0.86 adult         1         1         1         1         1
##  3       3  0.1  adult         1         1         1         1         1
##  4       4  0.4  adult         1         1         1         1         0
##  5       5  0.2  adult         1         1         1         1         1
##  6       6  0.52 adult         1         1         1         1         1
##  7       7  0.84 juve~         1         1         1         1         1
##  8       8  0.26 juve~         1         1         1         1         1
##  9       9  0.56 adult         1         1         1         1         1
## 10      10  0.37 adult         1         1         1         1         1
## # ... with 15 more rows, and 51 more variables: fate_Day6 <int>,
## #   fate_Day7 <int>, fate_Day8 <int>, fate_Day9 <int>, fate_Day10 <int>,
## #   fate_Day11 <int>, fate_Day12 <int>, fate_Day13 <int>, fate_Day14 <int>,
## #   fate_Day15 <int>, fate_Day16 <int>, fate_Day17 <int>, fate_Day18 <int>,
## #   fate_Day19 <int>, fate_Day20 <int>, fate_Day21 <int>, fate_Day22 <int>,
## #   fate_Day23 <int>, fate_Day24 <int>, fate_Day25 <int>, fate_Day26 <int>,
## #   fate_Day27 <int>, fate_Day28 <int>, veg_max_Day1 <dbl>, veg_max_Day2 <dbl>,
## #   veg_max_Day3 <dbl>, veg_max_Day4 <dbl>, veg_max_Day5 <dbl>,
## #   veg_max_Day6 <dbl>, veg_max_Day7 <dbl>, veg_max_Day8 <dbl>,
## #   veg_max_Day9 <dbl>, veg_max_Day10 <dbl>, veg_max_Day11 <dbl>,
## #   veg_max_Day12 <dbl>, veg_max_Day13 <dbl>, veg_max_Day14 <dbl>,
## #   veg_max_Day15 <dbl>, veg_max_Day16 <dbl>, veg_max_Day17 <dbl>,
## #   veg_max_Day18 <dbl>, veg_max_Day19 <dbl>, veg_max_Day20 <dbl>,
## #   veg_max_Day21 <dbl>, veg_max_Day22 <dbl>, veg_max_Day23 <dbl>,
## #   veg_max_Day24 <dbl>, veg_max_Day25 <dbl>, veg_max_Day26 <dbl>,
## #   veg_max_Day27 <dbl>, veg_max_Day28 <dbl>

# Now we will tidy up this mess with some good helper functions from tidyverse
nest_data %>% 
  pivot_longer(cols = contains("Day"), 
               names_to = c(".value", "day"),  
               names_pattern = "(._?.*)_(.*)")
## # A tibble: 700 x 6
##    Nest_ID    gc age   day    fate veg_max
##      <int> <dbl> <chr> <chr> <int>   <dbl>
##  1       1  0.23 adult Day1      1      16
##  2       1  0.23 adult Day2      1     185
##  3       1  0.23 adult Day3      1     249
##  4       1  0.23 adult Day4      1      55
##  5       1  0.23 adult Day5      1     106
##  6       1  0.23 adult Day6      1     201
##  7       1  0.23 adult Day7      1     108
##  8       1  0.23 adult Day8      1     134
##  9       1  0.23 adult Day9      1      13
## 10       1  0.23 adult Day10     1      78
## # ... with 690 more rows

3 Extracting variables and observations


It is often necessary to extract an entire variable or set of variables or observations. This is easily achieved with filter() for observations or select() with variables.

filter() allows you to subset observations based on their value. We will work on the mpg data set included with the tidyverse packages.

head(mpg)
## # A tibble: 6 x 11
##   manufacturer model      displ  year   cyl trans  drv     cty   hwy fl    class
##   <chr>        <chr>      <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
## 1 chevrolet    k1500 tah~   5.7  1999     8 auto(~ 4        11    15 r     suv  
## 2 dodge        dakota pi~   5.2  1999     8 auto(~ 4        11    15 r     pick~
## 3 dodge        durango 4~   5.9  1999     8 auto(~ 4        11    15 r     suv  
## 4 dodge        ram 1500 ~   5.2  1999     8 auto(~ 4        11    15 r     pick~
## 5 dodge        ram 1500 ~   5.9  1999     8 auto(~ 4        11    15 r     pick~
## 6 ford         f150 pick~   5.4  1999     8 auto(~ 4        11    15 r     pick~

# See what manufacturers we have
unique(mpg$manufacturer)
##  [1] "chevrolet"  "dodge"      "ford"       "land rover" "toyota"    
##  [6] "lincoln"    "jeep"       "mercury"    "nissan"     "volkswagen"
## [11] "audi"       "subaru"     "hyundai"    "pontiac"    "honda"

# Subset the data to include only Dodge
mpg %>%
  filter(manufacturer == "dodge")
## # A tibble: 37 x 11
##    manufacturer model     displ  year   cyl trans  drv     cty   hwy fl    class
##    <chr>        <chr>     <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
##  1 dodge        dakota p~   5.2  1999     8 auto(~ 4        11    15 r     pick~
##  2 dodge        durango ~   5.9  1999     8 auto(~ 4        11    15 r     suv  
##  3 dodge        ram 1500~   5.2  1999     8 auto(~ 4        11    15 r     pick~
##  4 dodge        ram 1500~   5.9  1999     8 auto(~ 4        11    15 r     pick~
##  5 dodge        durango ~   5.2  1999     8 auto(~ 4        11    16 r     suv  
##  6 dodge        ram 1500~   5.2  1999     8 manua~ 4        11    16 r     pick~
##  7 dodge        dakota p~   3.9  1999     6 auto(~ 4        13    17 r     pick~
##  8 dodge        dakota p~   3.9  1999     6 manua~ 4        14    17 r     pick~
##  9 dodge        dakota p~   5.2  1999     8 manua~ 4        11    17 r     pick~
## 10 dodge        durango ~   3.9  1999     6 auto(~ 4        13    17 r     suv  
## # ... with 27 more rows


# Filter out all but audi
mpg %>%
  filter(manufacturer != "audi")
## # A tibble: 216 x 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 chevrolet    k1500 tah~   5.7  1999     8 auto~ 4        11    15 r     suv  
##  2 dodge        dakota pi~   5.2  1999     8 auto~ 4        11    15 r     pick~
##  3 dodge        durango 4~   5.9  1999     8 auto~ 4        11    15 r     suv  
##  4 dodge        ram 1500 ~   5.2  1999     8 auto~ 4        11    15 r     pick~
##  5 dodge        ram 1500 ~   5.9  1999     8 auto~ 4        11    15 r     pick~
##  6 ford         f150 pick~   5.4  1999     8 auto~ 4        11    15 r     pick~
##  7 land rover   range rov~   4    1999     8 auto~ 4        11    15 p     suv  
##  8 land rover   range rov~   4.6  1999     8 auto~ 4        11    15 p     suv  
##  9 toyota       land crui~   4.7  1999     8 auto~ 4        11    15 r     suv  
## 10 dodge        durango 4~   5.2  1999     8 auto~ 4        11    16 r     suv  
## # ... with 206 more rows

# We want to know which cars get at least 20 mpg in the city
mpg %>%
  filter(cty >= 20)
## # A tibble: 56 x 11
##    manufacturer model    displ  year   cyl trans  drv     cty   hwy fl    class 
##    <chr>        <chr>    <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr> 
##  1 subaru       impreza~   2.2  1999     4 auto(~ 4        21    26 r     subco~
##  2 toyota       camry      2.2  1999     4 auto(~ f        21    27 r     midsi~
##  3 toyota       camry s~   2.2  1999     4 auto(~ f        21    27 r     compa~
##  4 audi         a4         1.8  1999     4 manua~ f        21    29 p     compa~
##  5 honda        civic      1.6  1999     4 manua~ f        23    29 p     subco~
##  6 nissan       altima     2.4  1999     4 manua~ f        21    29 r     compa~
##  7 toyota       camry      2.2  1999     4 manua~ f        21    29 r     midsi~
##  8 toyota       camry s~   2.2  1999     4 manua~ f        21    29 r     compa~
##  9 volkswagen   gti        2    1999     4 manua~ f        21    29 r     compa~
## 10 volkswagen   jetta      2    1999     4 manua~ f        21    29 r     compa~
## # ... with 46 more rows

filter() combines multiple arguments with "and", i.e., every expression must be true. If you want other combinations you can use Boolean operators (e.g., or is |).

# If we want Dodge or Ford manufacturers we use |
mpg %>% 
  filter(manufacturer == "dodge" | manufacturer == "ford")
## # A tibble: 62 x 11
##    manufacturer model     displ  year   cyl trans  drv     cty   hwy fl    class
##    <chr>        <chr>     <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
##  1 dodge        dakota p~   5.2  1999     8 auto(~ 4        11    15 r     pick~
##  2 dodge        durango ~   5.9  1999     8 auto(~ 4        11    15 r     suv  
##  3 dodge        ram 1500~   5.2  1999     8 auto(~ 4        11    15 r     pick~
##  4 dodge        ram 1500~   5.9  1999     8 auto(~ 4        11    15 r     pick~
##  5 ford         f150 pic~   5.4  1999     8 auto(~ 4        11    15 r     pick~
##  6 dodge        durango ~   5.2  1999     8 auto(~ 4        11    16 r     suv  
##  7 dodge        ram 1500~   5.2  1999     8 manua~ 4        11    16 r     pick~
##  8 ford         f150 pic~   4.6  1999     8 manua~ 4        13    16 r     pick~
##  9 ford         f150 pic~   4.6  1999     8 auto(~ 4        13    16 r     pick~
## 10 dodge        dakota p~   3.9  1999     6 auto(~ 4        13    17 r     pick~
## # ... with 52 more rows

# This won't work, because a manufacuter cannot be both Dodge and Ford
mpg %>% 
  filter(manufacturer == "dodge", manufacturer == "ford")
## # A tibble: 0 x 11
## # ... with 11 variables: manufacturer <chr>, model <chr>, displ <dbl>,
## #   year <int>, cyl <int>, trans <chr>, drv <chr>, cty <int>, hwy <int>,
## #   fl <chr>, class <chr>

# Here is a shorthand. This will select every row where manufacturer is one of the
# values in the c() function
mpg %>%
  filter(manufacturer %in% c("dodge", "ford"))
## # A tibble: 62 x 11
##    manufacturer model     displ  year   cyl trans  drv     cty   hwy fl    class
##    <chr>        <chr>     <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
##  1 dodge        dakota p~   5.2  1999     8 auto(~ 4        11    15 r     pick~
##  2 dodge        durango ~   5.9  1999     8 auto(~ 4        11    15 r     suv  
##  3 dodge        ram 1500~   5.2  1999     8 auto(~ 4        11    15 r     pick~
##  4 dodge        ram 1500~   5.9  1999     8 auto(~ 4        11    15 r     pick~
##  5 ford         f150 pic~   5.4  1999     8 auto(~ 4        11    15 r     pick~
##  6 dodge        durango ~   5.2  1999     8 auto(~ 4        11    16 r     suv  
##  7 dodge        ram 1500~   5.2  1999     8 manua~ 4        11    16 r     pick~
##  8 ford         f150 pic~   4.6  1999     8 manua~ 4        13    16 r     pick~
##  9 ford         f150 pic~   4.6  1999     8 auto(~ 4        13    16 r     pick~
## 10 dodge        dakota p~   3.9  1999     6 auto(~ 4        13    17 r     pick~
## # ... with 52 more rows

# You can also randomly select a percentage or set number of rows
mpg %>%
  sample_n(25, replace = FALSE)
## # A tibble: 25 x 11
##    manufacturer model     displ  year   cyl trans  drv     cty   hwy fl    class
##    <chr>        <chr>     <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
##  1 audi         a4          3.1  2008     6 auto(~ f        18    27 p     comp~
##  2 volkswagen   gti         2    1999     4 manua~ f        21    29 r     comp~
##  3 audi         a4          1.8  1999     4 manua~ f        21    29 p     comp~
##  4 land rover   range ro~   4    1999     8 auto(~ 4        11    15 p     suv  
##  5 chevrolet    k1500 ta~   5.7  1999     8 auto(~ 4        11    15 r     suv  
##  6 dodge        dakota p~   4.7  2008     8 auto(~ 4        14    19 r     pick~
##  7 nissan       pathfind~   3.3  1999     6 manua~ 4        15    17 r     suv  
##  8 chevrolet    c1500 su~   5.3  2008     8 auto(~ r        14    20 r     suv  
##  9 volkswagen   jetta       2.8  1999     6 auto(~ f        16    23 r     comp~
## 10 ford         f150 pic~   4.6  1999     8 manua~ 4        13    16 r     pick~
## # ... with 15 more rows

mpg %>%
  sample_frac(0.10, replace = FALSE)
## # A tibble: 23 x 11
##    manufacturer model    displ  year   cyl trans  drv     cty   hwy fl    class 
##    <chr>        <chr>    <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr> 
##  1 toyota       toyota ~   2.7  1999     4 manua~ 4        15    20 r     pickup
##  2 dodge        caravan~   3.3  2008     6 auto(~ f        11    17 e     miniv~
##  3 volkswagen   new bee~   2.5  2008     5 manua~ f        20    28 r     subco~
##  4 dodge        caravan~   3.8  1999     6 auto(~ f        15    21 r     miniv~
##  5 chevrolet    c1500 s~   5.7  1999     8 auto(~ r        13    17 r     suv   
##  6 ford         f150 pi~   4.6  2008     8 auto(~ 4        13    17 r     pickup
##  7 nissan       altima     2.4  1999     4 manua~ f        21    29 r     compa~
##  8 dodge        durango~   5.9  1999     8 auto(~ 4        11    15 r     suv   
##  9 nissan       altima     2.5  2008     4 manua~ f        23    32 r     midsi~
## 10 audi         a4 quat~   2.8  1999     6 auto(~ 4        15    25 p     compa~
## # ... with 13 more rows

Sometimes we might want to subset observations containing a particular sequence of letters. In our mpg data set, there are many different models that contain "4wd", more than we would want to write out. Instead, we can detect the sting using str_detect().

mpg %>% 
  filter(str_detect(model,'4wd')) 
## # A tibble: 74 x 11
##    manufacturer model     displ  year   cyl trans  drv     cty   hwy fl    class
##    <chr>        <chr>     <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
##  1 chevrolet    k1500 ta~   5.7  1999     8 auto(~ 4        11    15 r     suv  
##  2 dodge        dakota p~   5.2  1999     8 auto(~ 4        11    15 r     pick~
##  3 dodge        durango ~   5.9  1999     8 auto(~ 4        11    15 r     suv  
##  4 dodge        ram 1500~   5.2  1999     8 auto(~ 4        11    15 r     pick~
##  5 dodge        ram 1500~   5.9  1999     8 auto(~ 4        11    15 r     pick~
##  6 ford         f150 pic~   5.4  1999     8 auto(~ 4        11    15 r     pick~
##  7 toyota       land cru~   4.7  1999     8 auto(~ 4        11    15 r     suv  
##  8 dodge        durango ~   5.2  1999     8 auto(~ 4        11    16 r     suv  
##  9 dodge        ram 1500~   5.2  1999     8 manua~ 4        11    16 r     pick~
## 10 ford         f150 pic~   4.6  1999     8 manua~ 4        13    16 r     pick~
## # ... with 64 more rows

It is not uncommon to get a large data set with variables that aren't needed for a specific analysis. select() allows you to extract certain variables for easy data sub setting.

# We can select certain variables by name
mpg %>%
  select(model, displ, hwy)
## # A tibble: 234 x 3
##    model                  displ   hwy
##    <chr>                  <dbl> <int>
##  1 k1500 tahoe 4wd          5.7    15
##  2 dakota pickup 4wd        5.2    15
##  3 durango 4wd              5.9    15
##  4 ram 1500 pickup 4wd      5.2    15
##  5 ram 1500 pickup 4wd      5.9    15
##  6 f150 pickup 4wd          5.4    15
##  7 range rover              4      15
##  8 range rover              4.6    15
##  9 land cruiser wagon 4wd   4.7    15
## 10 durango 4wd              5.2    16
## # ... with 224 more rows

# We can select all but named variables
mpg %>%
  select(-model, -year, -hwy)
## # A tibble: 234 x 8
##    manufacturer displ   cyl trans    drv     cty fl    class 
##    <chr>        <dbl> <int> <chr>    <chr> <int> <chr> <chr> 
##  1 chevrolet      5.7     8 auto(l4) 4        11 r     suv   
##  2 dodge          5.2     8 auto(l4) 4        11 r     pickup
##  3 dodge          5.9     8 auto(l4) 4        11 r     suv   
##  4 dodge          5.2     8 auto(l4) 4        11 r     pickup
##  5 dodge          5.9     8 auto(l4) 4        11 r     pickup
##  6 ford           5.4     8 auto(l4) 4        11 r     pickup
##  7 land rover     4       8 auto(l4) 4        11 p     suv   
##  8 land rover     4.6     8 auto(l4) 4        11 p     suv   
##  9 toyota         4.7     8 auto(l4) 4        11 r     suv   
## 10 dodge          5.2     8 auto(l4) 4        11 r     suv   
## # ... with 224 more rows

# We can select by position
mpg %>%
  select(1, 5, 6)
## # A tibble: 234 x 3
##    manufacturer   cyl trans   
##    <chr>        <int> <chr>   
##  1 chevrolet        8 auto(l4)
##  2 dodge            8 auto(l4)
##  3 dodge            8 auto(l4)
##  4 dodge            8 auto(l4)
##  5 dodge            8 auto(l4)
##  6 ford             8 auto(l4)
##  7 land rover       8 auto(l4)
##  8 land rover       8 auto(l4)
##  9 toyota           8 auto(l4)
## 10 dodge            8 auto(l4)
## # ... with 224 more rows

# We can select all columns from 2:6, we could also do that by name
mpg %>%
  select(2:6)
## # A tibble: 234 x 5
##    model                  displ  year   cyl trans   
##    <chr>                  <dbl> <int> <int> <chr>   
##  1 k1500 tahoe 4wd          5.7  1999     8 auto(l4)
##  2 dakota pickup 4wd        5.2  1999     8 auto(l4)
##  3 durango 4wd              5.9  1999     8 auto(l4)
##  4 ram 1500 pickup 4wd      5.2  1999     8 auto(l4)
##  5 ram 1500 pickup 4wd      5.9  1999     8 auto(l4)
##  6 f150 pickup 4wd          5.4  1999     8 auto(l4)
##  7 range rover              4    1999     8 auto(l4)
##  8 range rover              4.6  1999     8 auto(l4)
##  9 land cruiser wagon 4wd   4.7  1999     8 auto(l4)
## 10 durango 4wd              5.2  1999     8 auto(l4)
## # ... with 224 more rows

You can also use some of the helper functions to select certain variables:

# Select variables that contain "y"
mpg %>% 
  select(contains('y'))
## # A tibble: 234 x 4
##     year   cyl   cty   hwy
##    <int> <int> <int> <int>
##  1  1999     8    11    15
##  2  1999     8    11    15
##  3  1999     8    11    15
##  4  1999     8    11    15
##  5  1999     8    11    15
##  6  1999     8    11    15
##  7  1999     8    11    15
##  8  1999     8    11    15
##  9  1999     8    11    15
## 10  1999     8    11    16
## # ... with 224 more rows

# Select variables that start with "c"
mpg %>% select(starts_with('c'))
## # A tibble: 234 x 3
##      cyl   cty class 
##    <int> <int> <chr> 
##  1     8    11 suv   
##  2     8    11 pickup
##  3     8    11 suv   
##  4     8    11 pickup
##  5     8    11 pickup
##  6     8    11 pickup
##  7     8    11 suv   
##  8     8    11 suv   
##  9     8    11 suv   
## 10     8    11 suv   
## # ... with 224 more rows


# Select can also be used to rearrange you data
mpg %>% select(class, displ, year, everything())
## # A tibble: 234 x 11
##    class  displ  year manufacturer model       cyl trans drv     cty   hwy fl   
##    <chr>  <dbl> <int> <chr>        <chr>     <int> <chr> <chr> <int> <int> <chr>
##  1 suv      5.7  1999 chevrolet    k1500 ta~     8 auto~ 4        11    15 r    
##  2 pickup   5.2  1999 dodge        dakota p~     8 auto~ 4        11    15 r    
##  3 suv      5.9  1999 dodge        durango ~     8 auto~ 4        11    15 r    
##  4 pickup   5.2  1999 dodge        ram 1500~     8 auto~ 4        11    15 r    
##  5 pickup   5.9  1999 dodge        ram 1500~     8 auto~ 4        11    15 r    
##  6 pickup   5.4  1999 ford         f150 pic~     8 auto~ 4        11    15 r    
##  7 suv      4    1999 land rover   range ro~     8 auto~ 4        11    15 p    
##  8 suv      4.6  1999 land rover   range ro~     8 auto~ 4        11    15 p    
##  9 suv      4.7  1999 toyota       land cru~     8 auto~ 4        11    15 r    
## 10 suv      5.2  1999 dodge        durango ~     8 auto~ 4        11    16 r    
## # ... with 224 more rows

4 Adding variables and data summary


It is often necessary to create new variables, either from scratch or conditional on other variables. In the tidyverse package, mutate() will add new columns at the end of your data set. We will start by subsetting an example data set.

(sml_flights <- select(flights, year:day, ends_with("delay"), distance, air_time))
## # A tibble: 336,776 x 7
##     year month   day dep_delay arr_delay distance air_time
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
##  1  2013     1     1         2        11     1400      227
##  2  2013     1     1         4        20     1416      227
##  3  2013     1     1         2        33     1089      160
##  4  2013     1     1        -1       -18     1576      183
##  5  2013     1     1        -6       -25      762      116
##  6  2013     1     1        -4        12      719      150
##  7  2013     1     1        -5        19     1065      158
##  8  2013     1     1        -3       -14      229       53
##  9  2013     1     1        -3        -8      944      140
## 10  2013     1     1        -2         8      733      138
## # ... with 336,766 more rows

# Now create new columns using mutate()
sml_flights %>%
  mutate(gain = dep_delay - arr_delay, 
         hours = air_time / 60, 
         gain_per_hour = gain / hours)
## # A tibble: 336,776 x 10
##     year month   day dep_delay arr_delay distance air_time  gain hours
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
##  1  2013     1     1         2        11     1400      227    -9 3.78 
##  2  2013     1     1         4        20     1416      227   -16 3.78 
##  3  2013     1     1         2        33     1089      160   -31 2.67 
##  4  2013     1     1        -1       -18     1576      183    17 3.05 
##  5  2013     1     1        -6       -25      762      116    19 1.93 
##  6  2013     1     1        -4        12      719      150   -16 2.5  
##  7  2013     1     1        -5        19     1065      158   -24 2.63 
##  8  2013     1     1        -3       -14      229       53    11 0.883
##  9  2013     1     1        -3        -8      944      140     5 2.33 
## 10  2013     1     1        -2         8      733      138   -10 2.3  
## # ... with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>

# If you only want to keep the new variable, use transmute()
sml_flights %>%
  transmute(gain = dep_delay - arr_delay, 
         hours = air_time / 60, 
         gain_per_hour = gain / hours)
## # A tibble: 336,776 x 3
##     gain hours gain_per_hour
##    <dbl> <dbl>         <dbl>
##  1    -9 3.78          -2.38
##  2   -16 3.78          -4.23
##  3   -31 2.67         -11.6 
##  4    17 3.05           5.57
##  5    19 1.93           9.83
##  6   -16 2.5           -6.4 
##  7   -24 2.63          -9.11
##  8    11 0.883         12.5 
##  9     5 2.33           2.14
## 10   -10 2.3           -4.35
## # ... with 336,766 more rows

You can use all of the familiar arithmetic operators (e.g., +, -, ^) with mutate. Modular arithmetic is a handy tool because it allows you to break integers up into pieces. %/% is integer division and %% is the remainder. For example, in the flights data set you can compute the hour and minute from the depart time:

flights %>%
  transmute(dep_time, 
            hour = dep_time %/% 100, 
            minute = dep_time %% 100)
## # A tibble: 336,776 x 3
##    dep_time  hour minute
##       <int> <dbl>  <dbl>
##  1      517     5     17
##  2      533     5     33
##  3      542     5     42
##  4      544     5     44
##  5      554     5     54
##  6      554     5     54
##  7      555     5     55
##  8      557     5     57
##  9      557     5     57
## 10      558     5     58
## # ... with 336,766 more rows

You can also determine offsets or running differences using lead() and lag().

x <- 1:10

lag(x)
##  [1] NA  1  2  3  4  5  6  7  8  9
lead(x)
##  [1]  2  3  4  5  6  7  8  9 10 NA

# Running differences
x - lag(x)
##  [1] NA  1  1  1  1  1  1  1  1  1

# Find when values change
(x != lag(x))
##  [1]   NA TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

# There are also cumulative and rolling aggregates
cumsum(x)
##  [1]  1  3  6 10 15 21 28 36 45 55
cummean(x)
##  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

When you are creating a new variable based on certain conditions being met in other variables you can use case_when instead of stacking ifelse statements.

flights %>%
  transmute(dep_time, 
            hour = dep_time %/% 100, 
            minute = dep_time %% 100, 
            time_category = case_when(
              hour <= 8 ~ "early", 
              hour > 8 & hour <= 18 ~ "good", 
              hour > 18 ~ "late"
            )) %>%
  sample_n(20, replace = FALSE)
## # A tibble: 20 x 4
##    dep_time  hour minute time_category
##       <int> <dbl>  <dbl> <chr>        
##  1     1031    10     31 good         
##  2     1806    18      6 good         
##  3     1330    13     30 good         
##  4       NA    NA     NA <NA>         
##  5     1117    11     17 good         
##  6     2058    20     58 late         
##  7     1654    16     54 good         
##  8     1503    15      3 good         
##  9      903     9      3 good         
## 10     1843    18     43 good         
## 11     1624    16     24 good         
## 12     1226    12     26 good         
## 13     1239    12     39 good         
## 14     2040    20     40 late         
## 15     1740    17     40 good         
## 16     1603    16      3 good         
## 17      605     6      5 early        
## 18     1139    11     39 good         
## 19     1027    10     27 good         
## 20     1509    15      9 good

Calculating data summary statistics is also necessary. This is easily achieved with summarise() and also helper functions like group_by() or n(). summarise() will collapse data to a single row:

summarise(flights, 
          mean_delay = mean(dep_delay, na.rm = TRUE), 
          sd_delay = sd(dep_delay, na.rm = TRUE),
          min_delay = min(dep_delay, na.rm = TRUE),
          max_delay = max(dep_delay, na.rm = TRUE),
          distinct_delay = n_distinct(dep_delay),
          nrow = n())
## # A tibble: 1 x 6
##   mean_delay sd_delay min_delay max_delay distinct_delay   nrow
##        <dbl>    <dbl>     <dbl>     <dbl>          <int>  <int>
## 1       12.6     40.2       -43      1301            528 336776

The utility of summarise() isn't fully realized until you pair it with group_by(). This changes the unit of analysis from the complete data set to the group. Then, you can apply functions to the group.

# Get average delay stats per day. Each unique group gets it's own row
flights %>% 
  group_by(year, month, day) %>%
  summarise(mean_delay = mean(dep_delay, na.rm = TRUE), 
          sd_delay = sd(dep_delay, na.rm = TRUE),
          min_delay = min(dep_delay, na.rm = TRUE),
          max_delay = max(dep_delay, na.rm = TRUE),
          distinct_delay = n_distinct(dep_delay),
          nrow = n())
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 9
## # Groups:   year, month [12]
##     year month   day mean_delay sd_delay min_delay max_delay distinct_delay
##    <int> <int> <int>      <dbl>    <dbl>     <dbl>     <dbl>          <int>
##  1  2013     1     1      11.5      45.3       -15       853            108
##  2  2013     1     2      13.9      37.2       -13       379            119
##  3  2013     1     3      11.0      31.5       -13       291            113
##  4  2013     1     4       8.95     27.7       -19       288            108
##  5  2013     1     5       5.73     25.7       -16       327             83
##  6  2013     1     6       7.15     23.2       -15       202             93
##  7  2013     1     7       5.42     27.1       -17       366             97
##  8  2013     1     8       2.55     19.3       -16       188             86
##  9  2013     1     9       2.28     46.9       -17      1301             75
## 10  2013     1    10       2.84     43.9       -17      1126             84
## # ... with 355 more rows, and 1 more variable: nrow <int>

# Group by day only
flights %>% 
  group_by(day) %>%
  summarise(mean_delay = mean(dep_delay, na.rm = TRUE), 
          sd_delay = sd(dep_delay, na.rm = TRUE),
          min_delay = min(dep_delay, na.rm = TRUE),
          max_delay = max(dep_delay, na.rm = TRUE),
          distinct_delay = n_distinct(dep_delay),
          nrow = n())
## # A tibble: 31 x 7
##      day mean_delay sd_delay min_delay max_delay distinct_delay  nrow
##  * <int>      <dbl>    <dbl>     <dbl>     <dbl>          <int> <int>
##  1     1      14.2      41.8       -23       853            286 11036
##  2     2      14.1      43.9       -24       696            301 10808
##  3     3      10.8      36.8       -33       878            266 11211
##  4     4       5.79     27.2       -20       545            222 11059
##  5     5       7.82     37.5       -24       896            281 10858
##  6     6       6.99     29.8       -22       589            225 11059
##  7     7      14.3      43.4       -43       653            298 10985
##  8     8      21.8      53.0       -21       520            333 11271
##  9     9      14.6      40.7       -26      1301            266 10857
## 10    10      18.3      51.8       -32      1126            322 11227
## # ... with 21 more rows

# Find the number of flights that were early each day
flights %>%
  mutate(dep_time, 
            hour = dep_time %/% 100, 
            minute = dep_time %% 100, 
            time_category = case_when(
              hour <= 8 ~ "early", 
              hour > 8 & hour <= 18 ~ "good", 
              hour > 18 ~ "late"
            )) %>%
  group_by(year, month, day) %>%
  summarise(n_early = sum(time_category == "early", na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day n_early
##    <int> <int> <int>   <int>
##  1  2013     1     1     169
##  2  2013     1     2     224
##  3  2013     1     3     214
##  4  2013     1     4     223
##  5  2013     1     5     183
##  6  2013     1     6     157
##  7  2013     1     7     222
##  8  2013     1     8     228
##  9  2013     1     9     226
## 10  2013     1    10     234
## # ... with 355 more rows

#OR
flights %>%
  mutate(dep_time, 
            hour = dep_time %/% 100, 
            minute = dep_time %% 100, 
            time_category = case_when(
              hour <= 8 ~ "early", 
              hour > 8 & hour <= 18 ~ "good", 
              hour > 18 ~ "late"
            )) %>%
  group_by(year, month, day, time_category) %>%
  summarise(n_cat = n())
## `summarise()` has grouped output by 'year', 'month', 'day'. You can override using the `.groups` argument.
## # A tibble: 1,451 x 5
## # Groups:   year, month, day [365]
##     year month   day time_category n_cat
##    <int> <int> <int> <chr>         <int>
##  1  2013     1     1 early           169
##  2  2013     1     1 good            526
##  3  2013     1     1 late            143
##  4  2013     1     1 <NA>              4
##  5  2013     1     2 early           224
##  6  2013     1     2 good            555
##  7  2013     1     2 late            156
##  8  2013     1     2 <NA>              8
##  9  2013     1     3 early           214
## 10  2013     1     3 good            537
## # ... with 1,441 more rows

When working with grouped data, you can use ungroup() to strip away all groups and apply functions to the data set as a whole and not by group. group_by() can be combined with all other tidyverse functions such as mutate() and filter().

# Worst delays for each group
sml_flights %>% 
  group_by(year, month, day) %>%
  filter(rank(desc(arr_delay)) < 10)
## # A tibble: 3,306 x 7
## # Groups:   year, month, day [365]
##     year month   day dep_delay arr_delay distance air_time
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
##  1  2013     1     1       853       851      184       41
##  2  2013     1     1       290       338     1134      213
##  3  2013     1     1       260       263      266       46
##  4  2013     1     1       157       174      213       60
##  5  2013     1     1       216       222      708      121
##  6  2013     1     1       255       250      589      115
##  7  2013     1     1       285       246     1085      146
##  8  2013     1     1       192       191      199       44
##  9  2013     1     1       379       456     1092      222
## 10  2013     1     2       224       207      550       94
## # ... with 3,296 more rows

# Groups bigger than threashold
flights %>%
  group_by(dest) %>%
  filter(n() > 365) %>%
  select(dest, everything())
## # A tibble: 332,577 x 19
## # Groups:   dest [77]
##    dest   year month   day dep_time sched_dep_time dep_delay arr_time
##    <chr> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1 IAH    2013     1     1      517            515         2      830
##  2 IAH    2013     1     1      533            529         4      850
##  3 MIA    2013     1     1      542            540         2      923
##  4 BQN    2013     1     1      544            545        -1     1004
##  5 ATL    2013     1     1      554            600        -6      812
##  6 ORD    2013     1     1      554            558        -4      740
##  7 FLL    2013     1     1      555            600        -5      913
##  8 IAD    2013     1     1      557            600        -3      709
##  9 MCO    2013     1     1      557            600        -3      838
## 10 ORD    2013     1     1      558            600        -2      753
## # ... with 332,567 more rows, and 11 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

# Then standardize to compare group stats
flights %>%
  group_by(dest) %>%
  filter(n() > 365) %>%
  filter(arr_delay > 0) %>%
  mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
  select(year:day, dest, arr_delay, prop_delay)
## # A tibble: 131,106 x 6
## # Groups:   dest [77]
##     year month   day dest  arr_delay prop_delay
##    <int> <int> <int> <chr>     <dbl>      <dbl>
##  1  2013     1     1 IAH          11  0.000111 
##  2  2013     1     1 IAH          20  0.000201 
##  3  2013     1     1 MIA          33  0.000235 
##  4  2013     1     1 ORD          12  0.0000424
##  5  2013     1     1 FLL          19  0.0000938
##  6  2013     1     1 ORD           8  0.0000283
##  7  2013     1     1 LAX           7  0.0000344
##  8  2013     1     1 DFW          31  0.000282 
##  9  2013     1     1 ATL          12  0.0000400
## 10  2013     1     1 DTW          16  0.000116 
## # ... with 131,096 more rows

Other ways to create new variables is by separating or uniting existing variables. separate() pulls apart one column into multiple columns by splitting based on a separator character or regex.

table3
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

# Now split the rate into cases and population size
table3 %>%
  separate(col = rate, into = c("cases", "population"))
## # A tibble: 6 x 4
##   country      year cases  population
##   <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

# Notice that cases and population are character columns. We can include convert=TRUE
table3 %>%
  separate(col = rate, into = c("cases", "population"), convert = TRUE)
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

# Separate by position
table3 %>%
  separate(year, into = c("century", "decade"), sep = 2, convert = TRUE)
## # A tibble: 6 x 4
##   country     century decade rate             
##   <chr>         <int>  <int> <chr>            
## 1 Afghanistan      19     99 745/19987071     
## 2 Afghanistan      20      0 2666/20595360    
## 3 Brazil           19     99 37737/172006362  
## 4 Brazil           20      0 80488/174504898  
## 5 China            19     99 212258/1272915272
## 6 China            20      0 213766/1280428583

By default it will split at any non-alphanumeric character. But you can also use any regular expression by including the sep = argument.

Now for a more complicated example. Lets create a super messy data set, and clean it up.

messy_data <- data.frame(
  state = sample(c("TN", "KY", "GA", "AL"), size = 40, replace = TRUE, 
                  prob = c(0.3, 0.2, 0.25, 0.25)), 
  ID = 1:40, 
  w1 = round(runif(40, 10, 900), 0), 
  w2 = round(runif(40, 10, 900), 0),
  w3 = sample(c(round(runif(1, 10, 900), 0), NA), size = 40, replace = TRUE, 
              prob = c(0.8, 0.2)),
  w4 = sample(c(round(runif(1, 10, 900), 0), NA), size = 40, replace = TRUE, 
              prob = c(0.4, 0.6))) %>%
  unite(wgt, starts_with("w"), sep = ",", na.rm = TRUE) %>%
  unite(dat, state, ID, wgt, sep = "_")
messy_data %>%
  head()
##                    dat
## 1 GA_1_229,171,387,549
## 2  TN_2_60,878,387,549
## 3          TN_3_599,99
## 4     GA_4_605,125,387
## 5      GA_5_76,663,387
## 6     GA_6_318,216,387

# Now separte our state, ID, and weights
messy_data %>%
  separate(dat, into = c("State", "ID", "Wgts"), sep = "_") %>%
  head()
##   State ID            Wgts
## 1    GA  1 229,171,387,549
## 2    TN  2  60,878,387,549
## 3    TN  3          599,99
## 4    GA  4     605,125,387
## 5    GA  5      76,663,387
## 6    GA  6     318,216,387

# Now we are going str_split() to split our strings, it will create a list of our splits
messy_data %>%
  separate(dat, into = c("State", "ID", "Wgts"), sep = "_") %>%
  mutate(wgt_list = str_split(Wgts, pattern = ",")) %>%
  select(-Wgts) %>%
  head()
##   State ID           wgt_list
## 1    GA  1 229, 171, 387, 549
## 2    TN  2  60, 878, 387, 549
## 3    TN  3            599, 99
## 4    GA  4      605, 125, 387
## 5    GA  5       76, 663, 387
## 6    GA  6      318, 216, 387

# We will unnest() the list column to create a long-form table
messy_data %>%
  separate(dat, into = c("State", "ID", "Wgts"), sep = "_") %>%
  mutate(wgt_list = str_split(Wgts, pattern = ",")) %>%
  select(-Wgts) %>%
  unnest(col = wgt_list)
## # A tibble: 127 x 3
##    State ID    wgt_list
##    <chr> <chr> <chr>   
##  1 GA    1     229     
##  2 GA    1     171     
##  3 GA    1     387     
##  4 GA    1     549     
##  5 TN    2     60      
##  6 TN    2     878     
##  7 TN    2     387     
##  8 TN    2     549     
##  9 TN    3     599     
## 10 TN    3     99      
## # ... with 117 more rows

The opposite of separate() is unite(), which combines multiple columns into a single column. It is most useful if you want to combine multiple fields (like capture year, capture number, and sex) to create unique and informative ID numbers.

table5
## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

# The default unites with the _ as a separator
table5 %>% 
  unite(new, century, year)
## # A tibble: 6 x 3
##   country     new   rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 19_99 745/19987071     
## 2 Afghanistan 20_00 2666/20595360    
## 3 Brazil      19_99 37737/172006362  
## 4 Brazil      20_00 80488/174504898  
## 5 China       19_99 212258/1272915272
## 6 China       20_00 213766/1280428583

# But you can change that to whatever you want
table5 %>% 
  unite(new, century, year, sep = "")
## # A tibble: 6 x 3
##   country     new   rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583

5 Relational data


Most analyses will involve multiple tables or excel sheets of data. Maybe you have one file with a list of ID numbers and individual characteristics like sex, age, capture date, etc. You might have a separate file with all the recaptures for every individual that doesn't include all the individual characteristics. These multiple excel tables are relational data and can be joined to combine the information contained in them.

We will dive into joins using the nycflights13 data that contains 4 tibbles

head(airlines)
## # A tibble: 6 x 2
##   carrier name                    
##   <chr>   <chr>                   
## 1 9E      Endeavor Air Inc.       
## 2 AA      American Airlines Inc.  
## 3 AS      Alaska Airlines Inc.    
## 4 B6      JetBlue Airways         
## 5 DL      Delta Air Lines Inc.    
## 6 EV      ExpressJet Airlines Inc.
head(airports)
## # A tibble: 6 x 8
##   faa   name                          lat   lon   alt    tz dst   tzone         
##   <chr> <chr>                       <dbl> <dbl> <dbl> <dbl> <chr> <chr>         
## 1 04G   Lansdowne Airport            41.1 -80.6  1044    -5 A     America/New_Y~
## 2 06A   Moton Field Municipal Airp~  32.5 -85.7   264    -6 A     America/Chica~
## 3 06C   Schaumburg Regional          42.0 -88.1   801    -6 A     America/Chica~
## 4 06N   Randall Airport              41.4 -74.4   523    -5 A     America/New_Y~
## 5 09J   Jekyll Island Airport        31.1 -81.4    11    -5 A     America/New_Y~
## 6 0A9   Elizabethton Municipal Air~  36.4 -82.2  1593    -5 A     America/New_Y~
head(planes)
## # A tibble: 6 x 9
##   tailnum  year type           manufacturer   model  engines seats speed engine 
##   <chr>   <int> <chr>          <chr>          <chr>    <int> <int> <int> <chr>  
## 1 N10156   2004 Fixed wing mu~ EMBRAER        EMB-1~       2    55    NA Turbo-~
## 2 N102UW   1998 Fixed wing mu~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
## 3 N103US   1999 Fixed wing mu~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
## 4 N104UW   1999 Fixed wing mu~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
## 5 N10575   2002 Fixed wing mu~ EMBRAER        EMB-1~       2    55    NA Turbo-~
## 6 N105UW   1999 Fixed wing mu~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
head(weather)
## # A tibble: 6 x 15
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
##   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
## 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA
## 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA
## 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA
## 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7         NA
## 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7         NA
## 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5         NA
## # ... with 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## #   time_hour <dttm>

Mutating joins combine variables from two tables. It first matches observations by their keys (unique identifiers) then copies across variables from one table to another.

flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)
head(flights2)
## # A tibble: 6 x 8
##    year month   day  hour origin dest  tailnum carrier
##   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
## 1  2013     1     1     5 EWR    IAH   N14228  UA     
## 2  2013     1     1     5 LGA    IAH   N24211  UA     
## 3  2013     1     1     5 JFK    MIA   N619AA  AA     
## 4  2013     1     1     5 JFK    BQN   N804JB  B6     
## 5  2013     1     1     6 LGA    ATL   N668DN  DL     
## 6  2013     1     1     5 EWR    ORD   N39463  UA

# Add the full airline number
flights2 %>%
  select(-origin, - dest) %>%
  left_join(airlines, by = "carrier")
## # A tibble: 336,776 x 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
## # ... with 336,766 more rows

There are different types of mutating joins. First is the inner join that matches pairs of observations whenever the keys are equal and keeps observations that appear in both tables.

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

x
## # A tibble: 3 x 2
##     key val_x
##   <dbl> <chr>
## 1     1 x1   
## 2     2 x2   
## 3     3 x3
y
## # A tibble: 3 x 2
##     key val_y
##   <dbl> <chr>
## 1     1 y1   
## 2     2 y2   
## 3     4 y3

# Inner join
x %>% 
  inner_join(y, by = "key")
## # A tibble: 2 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2

Outer joins keep observations that appear in at least one of the tables. A left join keeps all observations in x, a right join keeps all observations in y, and a full join keeps all observations of x and y.

# Left join
x %>% left_join(y, by = "key")
## # A tibble: 3 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     3 x3    <NA>

# Right join
x %>% right_join(y, by = "key")
## # A tibble: 3 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     4 <NA>  y3

# Full join
x %>% full_join(y, by = "key")
## # A tibble: 4 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2   
## 3     3 x3    <NA> 
## 4     4 <NA>  y3

If you don't define the key (i.e., the by argument) then it uses all column names that match to generate the key.

# Joining by year, month, day, hour, and origin
flights2 %>% 
  left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
##     year month   day  hour origin dest  tailnum carrier  temp  dewp humid
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
##  2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
##  3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
##  4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
##  5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
##  6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
##  7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
##  9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
## # ... with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>

If the key doesn't have matching column names, you can used a named character vector, e.g., by = c("a" = "b") which will match x$a to y$b.

head(flights2)
## # A tibble: 6 x 8
##    year month   day  hour origin dest  tailnum carrier
##   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
## 1  2013     1     1     5 EWR    IAH   N14228  UA     
## 2  2013     1     1     5 LGA    IAH   N24211  UA     
## 3  2013     1     1     5 JFK    MIA   N619AA  AA     
## 4  2013     1     1     5 JFK    BQN   N804JB  B6     
## 5  2013     1     1     6 LGA    ATL   N668DN  DL     
## 6  2013     1     1     5 EWR    ORD   N39463  UA
# Get lat and lon information for the airport based on destination airport
flights2 %>% 
  left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 x 15
##     year month   day  hour origin dest  tailnum carrier name    lat   lon   alt
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Geor~  30.0 -95.3    97
##  2  2013     1     1     5 LGA    IAH   N24211  UA      Geor~  30.0 -95.3    97
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      Miam~  25.8 -80.3     8
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>   NA    NA      NA
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Hart~  33.6 -84.4  1026
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Chic~  42.0 -87.9   668
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort~  26.1 -80.2     9
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      Wash~  38.9 -77.5   313
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      Orla~  28.4 -81.3    96
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chic~  42.0 -87.9   668
## # ... with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
## #   tzone <chr>

# Now based on origin airport
flights2 %>% 
  left_join(airports, c("origin" = "faa"))
## # A tibble: 336,776 x 15
##     year month   day  hour origin dest  tailnum carrier name    lat   lon   alt
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Newa~  40.7 -74.2    18
##  2  2013     1     1     5 LGA    IAH   N24211  UA      La G~  40.8 -73.9    22
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      John~  40.6 -73.8    13
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      John~  40.6 -73.8    13
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      La G~  40.8 -73.9    22
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Newa~  40.7 -74.2    18
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Newa~  40.7 -74.2    18
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      La G~  40.8 -73.9    22
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      John~  40.6 -73.8    13
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      La G~  40.8 -73.9    22
## # ... with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
## #   tzone <chr>

Filtering joins match observations similar to mutating joins, but affect the observations, not the variables. A semi_join(x, y) keeps all observations in x that have a match in y. An anti_join(x, y) drops all observations in x that have a match in y.

# Top destinations
top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
top_dest
## # A tibble: 10 x 2
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705


# Now find each flight that went to one of the top destinations
flights %>% 
  semi_join(top_dest)
## Joining, by = "dest"
## # A tibble: 141,145 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      554            600        -6      812            837
##  3  2013     1     1      554            558        -4      740            728
##  4  2013     1     1      555            600        -5      913            854
##  5  2013     1     1      557            600        -3      838            846
##  6  2013     1     1      558            600        -2      753            745
##  7  2013     1     1      558            600        -2      924            917
##  8  2013     1     1      558            600        -2      923            937
##  9  2013     1     1      559            559         0      702            706
## 10  2013     1     1      600            600         0      851            858
## # ... with 141,135 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>


# We can find out how many flights don't have a match in planes
flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(tailnum, sort = TRUE)
## # A tibble: 722 x 2
##    tailnum     n
##    <chr>   <int>
##  1 <NA>     2512
##  2 N725MQ    575
##  3 N722MQ    513
##  4 N723MQ    507
##  5 N713MQ    483
##  6 N735MQ    396
##  7 N0EGMQ    371
##  8 N534MQ    364
##  9 N542MQ    363
## 10 N531MQ    349
## # ... with 712 more rows