Strings are a special type of data. They are always classified as a character. When working with strings it is common to use regular expressions, or regexps. Strings usually contain unstructured or semi-structured data, and regexps are a concise language to describe patterns in strings. We will focus on the stringer package that is included in the core tidyverse.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.5 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Make a string with "" or ''
(string1 <- "This is a string")
## [1] "This is a string"
(string2 <- 'If I want to include a "quote" inside a string, I use single quotes')
## [1] "If I want to include a \"quote\" inside a string, I use single quotes"
(string3 <- "There are also many special \ncharacters in R, like \u00b5")
## [1] "There are also many special \ncharacters in R, like µ"
# See more with ?'"'
# String length
str_length(c("a", "R for data science", NA))
## [1] 1 18 NA
# Combining strings
str_c("x", "yz")
## [1] "xyz"
# Control how they are seperated
str_c("x", "yz", sep = ", ")
## [1] "x, yz"
# More complex
str_c("prefix-", c("a", "b", "c"), "-suffix")
## [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
# Extract parts of a string
x <- c("Apple", "Banana", "Orange")
str_sub(x, start = 1, end = 3)
## [1] "App" "Ban" "Ora"
str_sub(x, start = -3, end = -1)
## [1] "ple" "ana" "nge"
# You can also use the subest function to alter the text at a specific location
str_to_lower(str_sub(x, 1, 1))
## [1] "a" "b" "o"
Regular expressions are complicated, but can be useful once you understand how to use them. We will use str_view() and str_view_all() to show how a character vector and regular expression match up.
We will start with some basic matching.
# Matching exact letters
x <- c("apple", "banana", "pear")
str_view(x, "an")
# If you want to match any character, you use . (dot)
str_view(x, ".a.")
But what if you want to match "." in a string, such as x.y. We need to use the backslash \ to escape special behavior. Unfortunately, \ also has special behavior, therefore to match to "." in a string, you need to write \\.
# Matching to special behaviors, the dot
x <- c("sats", "a.x", "hat")
str_view(x, "a.")
str_view(x, "a\\.")
str_view(x, "a\\..")
# For a single backslash
x <- "a\\b"
writeLines(x)
## a\b
str_view(x, "\\\\")
Sometimes you might want to match expressions at the beginning or end of a string. ^ will match to the beginning of the string, and $ will match to the end.
# Match to the beginning
x <- c("apple", "banana", "pear")
str_view(x, "^a")
# Match to the end
x <- c("apple", "banana", "pear")
str_view(x, "a$")
# Match a complete string by anchoring the beginning and end
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")
We have already seen a special character, ., that matches any character. There are 4 more that are useful in R:
\d matches any digit\s matches any white space, such as space, tab, newline[abc] matches a, b, OR c[^abc] matches anything except a, b, OR cx <- c("apple123", "a4_pie3", "ab*cat", "a.c", "a c")
str_view(x, "\\d")
str_view(x, "\\s")
str_view(x, "[abc]")
str_view(x, "[^abc]")
# You can also contain some special characters, like . , inside [] to match them
str_view(x, "a[.]c")
str_view(x, ".[*].")
str_view(x, "a[ ]")
You can also pick between one or more alternative patterns. For example, "app|ear" to either "app" or "ear". You can also use () to make it clear what you want to match.
str_view(c("grey", "gray"), "gr(e|a)y")
The next level up in complexity is controlling how many times a pattern matches. ? will match to 0 or 1, + will match to 1 or more, and * will match to 0 or more.
x <- "MDCCCLXXXVIII"
str_view(x, "CC")
str_view(x, "CC?")
str_view(x, "CC+")
str_view(x, "C[LX]")
str_view(x, "C+[LX]")
str_view(x, "C[LX]+")
# You can also match to a specific number of cases
str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")
# Laxy matching, i.e., the shortest string
str_view(x, "C{2,3}?")
str_view(x, 'C[LX]+?')
Parentheses can also be used to create a numbered capturing group. A capturing group stores the part of the string that was matched by the regular expression inside the parentheses. You can then refer to these groups with back references.
# This will find all fruits that have a pair of repeated letters
# (..) finds all two letter combinations, \\1 is saying look at the two letter
# combinations, and by combining it you are telling R to match to 2 letters AND
# the back reference
str_view(fruit, "(..)\\1", match = TRUE)
Now that you have learned the basics of strings and regular expressions, it is time to apply what you have learned to manipulate data by determining which strings match a pattern, find position of matches, extract or replace matches, and split a string based on a match.
It is relatively easy to determine if characters in a vector matches a pattern using str_detect().
x <- c("apple", "banana", "pear")
str_detect(x, "e")
## [1] TRUE FALSE TRUE
# You can also combine with sum() or mean() to answer questions about a large vector
# How many common words start with t?
sum(str_detect(words, "^t"))
## [1] 65
# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
## [1] 0.2765306
# You can also negate the function to do the opposite
# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(words, "[aeiou]")
# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
# These are the same thing, but the first is way easier
identical(no_vowels_1, no_vowels_2)
## [1] TRUE
# Sometimes you might want to know the number of matches in a string
str_count(x, "a")
## [1] 1 3 1
# Or the average number of vowels per word
mean(str_count(words, "[aeiou]"))
## [1] 1.991837
str_detect() is an easy way to select elements that match a pattern, but an easier way to subset is the str_subset() wrapper:
# Pull out words that end in x
str_subset(words, "x$")
## [1] "box" "sex" "six" "tax"
# When you are working in a data frame you will want to filter instead
df <- data.frame(i = seq_along(words), word = words)
df %>% filter(str_detect(word, "x$"))
## i word
## 1 108 box
## 2 747 sex
## 3 772 six
## 4 841 tax
You should note that matches never overlap. For example, in "abababa", how many times will the pattern "aba" match? Regular expressions say two, not three:
str_count("abababa", "aba")
## [1] 2
str_view_all("abababa", "aba")
To extract the actual text of a match, we use str_extract()
# Look at sentences provided in the stringer package
length(sentences)
## [1] 720
head(sentences)
## [1] "The birch canoe slid on the smooth planks."
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."
## [4] "These days a chicken leg is a rare dish."
## [5] "Rice is often served in round bowls."
## [6] "The juice of lemons makes fine punch."
# If we want to find all sentences that contain a color, we can create a vector
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
# Then turn it into a regexp using str_c()
colour_match <- str_c(colours, collapse = "|")
colour_match
## [1] "red|orange|yellow|green|blue|purple"
# Now we can select sentences that contain a color and extract the color
has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
## [1] "blue" "blue" "red" "red" "red" "blue"
# Note that it only extracts the first match. We can get all matches by using
# str_extract_all()
matches <- str_extract_all(has_colour, colour_match)
We can also worked with grouped strings to extract parts of a complex match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we’ll look for any word that comes after “a” or “the” (i.e., "a|the ". Defining a “word” in a regular expression is a little tricky, so here I use a simple approximation: a sequence of at least one character "+" that isn’t a space "[^ ]".
# Define our pattern using () to group them
noun <- "(a|the) ([^ ]+)"
# Get the first 10 sentences that have matches
has_noun <- sentences %>%
str_subset(noun) %>%
head(10)
# Extract the matches
has_noun %>%
str_extract(noun)
## [1] "the smooth" "the sheet" "the depth" "a chicken" "the parked"
## [6] "the sun" "the huge" "the ball" "the woman" "a helps"
# Get each individual component
has_noun %>%
str_match(noun)
## [,1] [,2] [,3]
## [1,] "the smooth" "the" "smooth"
## [2,] "the sheet" "the" "sheet"
## [3,] "the depth" "the" "depth"
## [4,] "a chicken" "a" "chicken"
## [5,] "the parked" "the" "parked"
## [6,] "the sun" "the" "sun"
## [7,] "the huge" "the" "huge"
## [8,] "the ball" "the" "ball"
## [9,] "the woman" "the" "woman"
## [10,] "a helps" "a" "helps"
You can also replace matches with new strings using str_replace() and str_replace_all().
x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
## [1] "-pple" "p-ar" "b-nana"
str_replace_all(x, "[aeiou]", "-")
## [1] "-ppl-" "p--r" "b-n-n-"
# You can perform multiple replacements by supplying a vector
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
## [1] "one house" "two cars" "three people"
# You can also use backreferences to insert components of the match. Flip order of
# second and third words
sentences %>%
head(5)
## [1] "The birch canoe slid on the smooth planks."
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."
## [4] "These days a chicken leg is a rare dish."
## [5] "Rice is often served in round bowls."
sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
## [1] "The canoe birch slid on the smooth planks."
## [2] "Glue sheet the to the dark blue background."
## [3] "It's to easy tell the depth of a well."
## [4] "These a days chicken leg is a rare dish."
## [5] "Rice often is served in round bowls."
Another useful function is str_split(), which splits a string up into pieces.
# Split a sentence into words
sentences %>%
head(5) %>%
str_split(" ")
## [[1]]
## [1] "The" "birch" "canoe" "slid" "on" "the" "smooth"
## [8] "planks."
##
## [[2]]
## [1] "Glue" "the" "sheet" "to" "the"
## [6] "dark" "blue" "background."
##
## [[3]]
## [1] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
##
## [[4]]
## [1] "These" "days" "a" "chicken" "leg" "is" "a"
## [8] "rare" "dish."
##
## [[5]]
## [1] "Rice" "is" "often" "served" "in" "round" "bowls."
# This returns a list, with each sentence as a list component. You can return a
# matrix instead
sentences %>%
head(5) %>%
str_split(" ", simplify = TRUE)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
## [2,] "Glue" "the" "sheet" "to" "the" "dark" "blue" "background."
## [3,] "It's" "easy" "to" "tell" "the" "depth" "of" "a"
## [4,] "These" "days" "a" "chicken" "leg" "is" "a" "rare"
## [5,] "Rice" "is" "often" "served" "in" "round" "bowls." ""
## [,9]
## [1,] ""
## [2,] ""
## [3,] "well."
## [4,] "dish."
## [5,] ""
# You can request a max number of pieces
sentences %>%
head(5) %>%
str_split(" ", n = 2, simplify = TRUE)
## [,1] [,2]
## [1,] "The" "birch canoe slid on the smooth planks."
## [2,] "Glue" "the sheet to the dark blue background."
## [3,] "It's" "easy to tell the depth of a well."
## [4,] "These" "days a chicken leg is a rare dish."
## [5,] "Rice" "is often served in round bowls."
When dealing with data that has been entered in excel, you will likely come across inconsistent nomenclature. The simplest case of this can be differences in use of upper or lowercase. Regex takes other comments that can help solve some of these problems.
# You can ignore capitalization
bs <- c("banana", "Banana", "BANANA")
str_view_all(bs, "banana")
str_view_all(bs, regex("banana", ignore_case = TRUE))
# You can match a string to file names. dir() lists all files in a directory
dir(pattern = "\\.R$")
## character(0)
With spatial data you will undoubtedly encounter date/time data. These can be a headache to work with, and there are a few different packages to select from. We will be working in lubridate.
There are 3 different types of date/time data: date, time, and date-time (also known as POSIXct). There are a few ways in which you might create date/time data: from strings, from individual date-time components, or from a date/time object.
When you load an excel sheet into R with date/time data, these are often strings. Lubridate has helper functions to parse strings into date/time data. They automatically work out the date format once you specify the order.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(nycflights13)
# Year, Month, Day order
ymd("2017-01-31")
## [1] "2017-01-31"
# Month, Day, Year order
mdy("January 31st, 2017")
## [1] "2017-01-31"
# Day, Month, Year order
dmy("31-Jan-2017")
## [1] "2017-01-31"
# For date-time, add an underscore with h, m, s
ymd_hms("2017-01-31 20:11:59")
## [1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
## [1] "2017-01-31 08:01:00 UTC"
When you work outside of the lubridate package, you may need to specify your date-time format. The format is constructed from the following pieces: Year %Y (4 digits). %y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999. Month %m (2 digits). %b (abbreviated name, like “Jan”). %B (full name, “January”). Day %d (2 digits). %e (optional leading space). Time %H 0-23 hour. %I 0-12, must be used with %p. %p AM/PM indicator. %M minutes. %S integer seconds. %OS real seconds. %Z Time zone (as name, e.g. America/Chicago).
parse_date("01/02/15", "%m/%d/%y")
## [1] "2015-01-02"
parse_date("01/02/15", "%d/%m/%y")
## [1] "2015-02-01"
parse_date("01/02/15", "%y/%m/%d")
## [1] "2001-02-15"
Instead of a string, you might have individual components spread across multiple columns.
flights %>%
select(year, month, day, hour, minute)
## # A tibble: 336,776 x 5
## year month day hour minute
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 5 15
## 2 2013 1 1 5 29
## 3 2013 1 1 5 40
## 4 2013 1 1 5 45
## 5 2013 1 1 6 0
## 6 2013 1 1 5 58
## 7 2013 1 1 6 0
## 8 2013 1 1 6 0
## 9 2013 1 1 6 0
## 10 2013 1 1 6 0
## # ... with 336,766 more rows
To create a date/time from this, we use make_date() for dates, or make_datetime() for date-times.
flights %>%
select(year, month, day, hour, minute) %>%
mutate(departure = make_datetime(year, month, day, hour, minute))
## # A tibble: 336,776 x 6
## year month day hour minute departure
## <int> <int> <int> <dbl> <dbl> <dttm>
## 1 2013 1 1 5 15 2013-01-01 05:15:00
## 2 2013 1 1 5 29 2013-01-01 05:29:00
## 3 2013 1 1 5 40 2013-01-01 05:40:00
## 4 2013 1 1 5 45 2013-01-01 05:45:00
## 5 2013 1 1 6 0 2013-01-01 06:00:00
## 6 2013 1 1 5 58 2013-01-01 05:58:00
## 7 2013 1 1 6 0 2013-01-01 06:00:00
## 8 2013 1 1 6 0 2013-01-01 06:00:00
## 9 2013 1 1 6 0 2013-01-01 06:00:00
## 10 2013 1 1 6 0 2013-01-01 06:00:00
## # ... with 336,766 more rows
You may also want to switch between dates and date-times.
today()
## [1] "2021-01-26"
as_datetime(today())
## [1] "2021-01-26 UTC"
now()
## [1] "2021-01-26 09:57:54 CST"
as_date(now())
## [1] "2021-01-26"
Once you have date/time data in R, you can pull out individual parts with assessor functions like year() or wday() (day of week).
datetime <- ymd_hms("2016-07-08 12:34:56")
year(datetime)
## [1] 2016
month(datetime)
## [1] 7
mday(datetime)
## [1] 8
yday(datetime)
## [1] 190
wday(datetime)
## [1] 6
wday(datetime, label = TRUE, abbr = FALSE)
## [1] Friday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
# You can also set the components with these functions
year(datetime) <- 2020
datetime
## [1] "2020-07-08 12:34:56 UTC"
month(datetime) <- 01
datetime
## [1] "2020-01-08 12:34:56 UTC"
hour(datetime) <- hour(datetime) + 1
datetime
## [1] "2020-01-08 13:34:56 UTC"
# More efficiently, use update() to set multiple values at once
update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
## [1] "2020-02-02 02:34:56 UTC"
Sometimes you may need to round a date to a nearby unit of time.
head(flights)
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
# Make a new DF to hold our date-time data
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
dplyr::mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) %>%
dplyr::select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt %>% head()
## # A tibble: 6 x 9
## origin dest dep_delay arr_delay dep_time sched_dep_time
## <chr> <chr> <dbl> <dbl> <dttm> <dttm>
## 1 EWR IAH 2 11 2013-01-01 05:17:00 2013-01-01 05:15:00
## 2 LGA IAH 4 20 2013-01-01 05:33:00 2013-01-01 05:29:00
## 3 JFK MIA 2 33 2013-01-01 05:42:00 2013-01-01 05:40:00
## 4 JFK BQN -1 -18 2013-01-01 05:44:00 2013-01-01 05:45:00
## 5 LGA ATL -6 -25 2013-01-01 05:54:00 2013-01-01 06:00:00
## 6 EWR ORD -4 12 2013-01-01 05:54:00 2013-01-01 05:58:00
## # ... with 3 more variables: arr_time <dttm>, sched_arr_time <dttm>,
## # air_time <dbl>
flights_dt %>%
count(weekf = floor_date(dep_time, "week"), weekc = ceiling_date(dep_time, "week")) %>%
mutate(diffweek = weekc - weekf)
## # A tibble: 56 x 4
## weekf weekc n diffweek
## <dttm> <dttm> <int> <drtn>
## 1 2012-12-30 00:00:00 2013-01-06 00:00:00 4300 604800 secs
## 2 2013-01-06 00:00:00 2013-01-13 00:00:00 6082 604800 secs
## 3 2013-01-13 00:00:00 2013-01-20 00:00:00 5976 604800 secs
## 4 2013-01-20 00:00:00 2013-01-27 00:00:00 5925 604800 secs
## 5 2013-01-27 00:00:00 2013-02-03 00:00:00 5774 604800 secs
## 6 2013-02-03 00:00:00 2013-02-10 00:00:00 5160 604800 secs
## 7 2013-02-10 00:00:00 2013-02-17 00:00:00 6081 604800 secs
## 8 2013-02-17 00:00:00 2013-02-24 00:00:00 6253 604800 secs
## 9 2013-02-24 00:00:00 2013-03-03 00:00:00 6275 604800 secs
## 10 2013-03-03 00:00:00 2013-03-10 00:00:00 6027 604800 secs
## # ... with 46 more rows
The next thing we need to understand and work with are time spans. Often you may need to figure out how many days, weeks, hours, etc. have passed, or you may want to filter out dates that do not fall within a certain period or interval.
The duration of a time span represents an exact number of seconds. When you subtract two dates in R, you get a difftime object.
# How old am I today?
(my_age <- today() - ymd(19881122))
## Time difference of 11753 days
str(my_age)
## 'difftime' num 11753
## - attr(*, "units")= chr "days"
# Lubradate always reports duration in seconds
as.duration(my_age)
## [1] "1015459200s (~32.18 years)"
# Create a duration in seconds
dseconds(15)
## [1] "15s"
# In minutes...
dminutes(10)
## [1] "600s (~10 minutes)"
# All the way up to years
dyears(c(1,2))
## [1] "31557600s (~1 years)" "63115200s (~2 years)"
# Use basic math with durations
2 * dyears(1)
## [1] "63115200s (~2 years)"
today() + ddays(10)
## [1] "2021-02-05"
The problem is because duration represent an exact number of seconds, sometimes you can get unexpected results. Because of daylight savings time some days only have 23 hours or have up to 25 hours. Instead, you can use periods in lubridate. Periods are spans of time that don't have a fixed length in seconds. Instead, we can work in days, months, or years.
one_pm <- ymd_hms("2016-03-12 13:00:00", tz = "America/New_York")
one_pm
## [1] "2016-03-12 13:00:00 EST"
one_pm + ddays(1)
## [1] "2016-03-13 14:00:00 EDT"
# Now work in periods
one_pm + days(1)
## [1] "2016-03-13 13:00:00 EDT"
# Again, you can create periods
seconds(12)
## [1] "12S"
# All the way up to years
years(3)
## [1] "3y 0m 0d 0H 0M 0S"
# You can add and multiply
10 * (months(6) + days(1))
## [1] "60m 10d 0H 0M 0S"
days(50) + hours(25) + minutes(2)
## [1] "50d 25H 2M 0S"
The last time span you need to understand is an interval. What should years(1) / days(1) return? It will either return 365 or 366, depending on if it is a leap year.
years(1) / days(1)
## [1] 365.25
If you want a more accurate measurement, you will have to use an interval and set the starting point. Intervals in lubridate are denoted by start %--% end.
next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)
## [1] 365
# Another option
interval(start = today(), end = next_year) / ddays(1)
## [1] 365
# We can set an interval
(my_interval <- interval(start = today(), end = next_year))
## [1] 2021-01-26 UTC--2022-01-26 UTC
# We can flip it
int_flip(my_interval)
## [1] 2022-01-26 UTC--2021-01-26 UTC
# Shift it
int_shift(my_interval, ddays(20))
## [1] 2021-02-15 UTC--2022-02-15 UTC
# Find out how many days in each interval using integer division
(today() %--% next_year) %/% days(1)
## [1] 365
# You can also take a vector of date-time and get the interval between each element
head(flights_dt$dep_time)
## [1] "2013-01-01 05:17:00 UTC" "2013-01-01 05:33:00 UTC"
## [3] "2013-01-01 05:42:00 UTC" "2013-01-01 05:44:00 UTC"
## [5] "2013-01-01 05:54:00 UTC" "2013-01-01 05:54:00 UTC"
head(int_diff(flights_dt$dep_time))
## [1] 2013-01-01 05:17:00 UTC--2013-01-01 05:33:00 UTC
## [2] 2013-01-01 05:33:00 UTC--2013-01-01 05:42:00 UTC
## [3] 2013-01-01 05:42:00 UTC--2013-01-01 05:44:00 UTC
## [4] 2013-01-01 05:44:00 UTC--2013-01-01 05:54:00 UTC
## [5] 2013-01-01 05:54:00 UTC--2013-01-01 05:54:00 UTC
## [6] 2013-01-01 05:54:00 UTC--2013-01-01 05:55:00 UTC
The last piece of information we need about Date/time data is time zones. R uses IANA time zones which use a consistent naming scheme. You can find out what timezone you are currently in using Sys.timezone(). You can change what time zone is displayed or what time zone is used.
(x1 <- ymd_hms("2015-06-01 12:00:00", tz = "America/New_York"))
## [1] "2015-06-01 12:00:00 EDT"
(x2 <- ymd_hms("2015-06-01 18:00:00", tz = "Europe/Copenhagen"))
## [1] "2015-06-01 18:00:00 CEST"
(x3 <- ymd_hms("2015-06-02 04:00:00", tz = "Pacific/Auckland"))
## [1] "2015-06-02 04:00:00 NZST"
x4 <- c(x1, x2, x3)
# Change what is displayed
x4a <- with_tz(x4, tzone = "Australia/Lord_Howe")
x4a
## [1] "2015-06-02 02:30:00 +1030" "2015-06-02 02:30:00 +1030"
## [3] "2015-06-02 02:30:00 +1030"
x4a - x4
## Time differences in secs
## [1] 0 0 0
# Change the underlying time
x4b <- force_tz(x4, tzone = "Australia/Lord_Howe")
x4b
## [1] "2015-06-01 12:00:00 +1030" "2015-06-01 12:00:00 +1030"
## [3] "2015-06-01 12:00:00 +1030"
x4b - x4
## Time differences in hours
## [1] -14.5 -14.5 -14.5
Once you have imported data it is sometimes necessary to arrange it in a certain order or reshape it from long to wide format. Although there are base functions in R that can arrange variables, we will be working exclusively with dplyr and tidyr.
arrange() orders observations according to the values for a specific variable. The default is to order numbers from lowest to highest.
library(dplyr)
library(tidyverse)
# First we will look at our example database
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
str(mpg)
## tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
# We will order our data by year and highway mpg (hwy)
arrange(mpg, year, hwy)
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 chevrolet k1500 tah~ 5.7 1999 8 auto~ 4 11 15 r suv
## 2 dodge dakota pi~ 5.2 1999 8 auto~ 4 11 15 r pick~
## 3 dodge durango 4~ 5.9 1999 8 auto~ 4 11 15 r suv
## 4 dodge ram 1500 ~ 5.2 1999 8 auto~ 4 11 15 r pick~
## 5 dodge ram 1500 ~ 5.9 1999 8 auto~ 4 11 15 r pick~
## 6 ford f150 pick~ 5.4 1999 8 auto~ 4 11 15 r pick~
## 7 land rover range rov~ 4 1999 8 auto~ 4 11 15 p suv
## 8 land rover range rov~ 4.6 1999 8 auto~ 4 11 15 p suv
## 9 toyota land crui~ 4.7 1999 8 auto~ 4 11 15 r suv
## 10 dodge durango 4~ 5.2 1999 8 auto~ 4 11 16 r suv
## # ... with 224 more rows
# This is the same as above
mpg %>% arrange(year, hwy)
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 chevrolet k1500 tah~ 5.7 1999 8 auto~ 4 11 15 r suv
## 2 dodge dakota pi~ 5.2 1999 8 auto~ 4 11 15 r pick~
## 3 dodge durango 4~ 5.9 1999 8 auto~ 4 11 15 r suv
## 4 dodge ram 1500 ~ 5.2 1999 8 auto~ 4 11 15 r pick~
## 5 dodge ram 1500 ~ 5.9 1999 8 auto~ 4 11 15 r pick~
## 6 ford f150 pick~ 5.4 1999 8 auto~ 4 11 15 r pick~
## 7 land rover range rov~ 4 1999 8 auto~ 4 11 15 p suv
## 8 land rover range rov~ 4.6 1999 8 auto~ 4 11 15 p suv
## 9 toyota land crui~ 4.7 1999 8 auto~ 4 11 15 r suv
## 10 dodge durango 4~ 5.2 1999 8 auto~ 4 11 16 r suv
## # ... with 224 more rows
# Notice this did not change our actual data, that is because we did not save it
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
# If we want our new order to stay, we can save it as such
mpg <- mpg %>% arrange(year, hwy)
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 chevrolet k1500 tah~ 5.7 1999 8 auto(~ 4 11 15 r suv
## 2 dodge dakota pi~ 5.2 1999 8 auto(~ 4 11 15 r pick~
## 3 dodge durango 4~ 5.9 1999 8 auto(~ 4 11 15 r suv
## 4 dodge ram 1500 ~ 5.2 1999 8 auto(~ 4 11 15 r pick~
## 5 dodge ram 1500 ~ 5.9 1999 8 auto(~ 4 11 15 r pick~
## 6 ford f150 pick~ 5.4 1999 8 auto(~ 4 11 15 r pick~
# What about characters? Let's look
mpg %>% arrange(manufacturer, model)
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 2.8 1999 6 auto(l~ f 16 26 p comp~
## 2 audi a4 2.8 1999 6 manual~ f 18 26 p comp~
## 3 audi a4 1.8 1999 4 auto(l~ f 18 29 p comp~
## 4 audi a4 1.8 1999 4 manual~ f 21 29 p comp~
## 5 audi a4 3.1 2008 6 auto(a~ f 18 27 p comp~
## 6 audi a4 2 2008 4 auto(a~ f 21 30 p comp~
## 7 audi a4 2 2008 4 manual~ f 20 31 p comp~
## 8 audi a4 quat~ 1.8 1999 4 auto(l~ 4 16 25 p comp~
## 9 audi a4 quat~ 2.8 1999 6 auto(l~ 4 15 25 p comp~
## 10 audi a4 quat~ 2.8 1999 6 manual~ 4 17 25 p comp~
## # ... with 224 more rows
# We can also reverse the order with desc()
mpg %>% arrange(manufacturer, desc(model))
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a6 quat~ 2.8 1999 6 auto(l~ 4 15 24 p mids~
## 2 audi a6 quat~ 4.2 2008 8 auto(s~ 4 16 23 p mids~
## 3 audi a6 quat~ 3.1 2008 6 auto(s~ 4 17 25 p mids~
## 4 audi a4 quat~ 1.8 1999 4 auto(l~ 4 16 25 p comp~
## 5 audi a4 quat~ 2.8 1999 6 auto(l~ 4 15 25 p comp~
## 6 audi a4 quat~ 2.8 1999 6 manual~ 4 17 25 p comp~
## 7 audi a4 quat~ 1.8 1999 4 manual~ 4 18 26 p comp~
## 8 audi a4 quat~ 3.1 2008 6 auto(s~ 4 17 25 p comp~
## 9 audi a4 quat~ 3.1 2008 6 manual~ 4 15 25 p comp~
## 10 audi a4 quat~ 2 2008 4 auto(s~ 4 19 27 p comp~
## # ... with 224 more rows
Reshaping data is another important concept to grasp to be able to work effectively in R. Data can be organized in different ways, and some can be easier to use than others. Each data set below contains the same data and 4 variables (country, year, population size, # cases), but organized differently.
# All 4 variables have thier own column
table1
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
# The variables for population size and number of cases are now expressed as an
# observation
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
# The number of cases and population size are now combined into one variable, rate
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
# The data are spread across two tibbles, one for # cases and the other for population
# size
table4a # cases
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b # population
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
Hadley Whickham described 3 rules to a tidy data set:
So which table above is the tidy one?
What do you do with an untidy data set? Fortunately, the tidyverse has many built in functions to help you tidy up your data set. The first step is to figure out what the variables and observations are. Then reshaping the data is simple with pivot_longer() and pivot_wider.
A common problem in some data sets is that the column name is actually the value of a variable. table4a, for example, has the column names that represent values of the variable year. Each row also contains 2 observations, not one.
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
We can fix this by pivoting the columns into a new pair of variables. To do this, we need the columns to be pivoted, the name of the new variable to move the column names to, and the name of the variable that will hold the values from the columns.
table4a %>%
pivot_longer(cols = c('1999', '2000'), names_to = "Year", values_to = "Cases")
## # A tibble: 6 x 3
## country Year Cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
This makes the new data set longer. You have added more rows. pivot_wider() is the opposite, it will make a wider data set. You use it when an observation is scattered across multiple rows. In table2 a single observation is a country for each year, but each observation is spread across two rows. The country and year combination (i.e., Brazil 1999) is repeated twice.
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
To tidy this up, we need the column to take the new variable names from and the column to take the values from.
table2 %>%
pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
These are very simple cases, but pivot_longer() and pivot_wider() are easy to use even in complex situations.
Lets take a look at a hypothetical encounter history for nest surveys of wild turkey. Lets say we have 25 nests we were following, and each day we visited the nest we recorded whether or not the nest was still active (fate = 1) or had failed (fate = 0). We also took some vegetation measurements at each nest, and recorded the age of the nesting female. What are our variables? We need nest ID to keep track of the 25 nests, fate of the nest, and day. We also have age of the nesting female, max vegetation height, and ground cover. Each nest and day has an observation, but the veg metrics don't change. Let's be real, we probably did not collect and store our data in a tidy format. That is because tidy data isn't necessarily the easiest to enter the data. I created the hypothetical data in excel, loaded it into R, then used dput() to get code to recreate the data.
# Our hypothetical data
nest_data <- structure(list(Nest_ID = 1:25, veg_max = c(14L, 36L, 37L, 83L,
63L, 42L, 173L, 217L, 135L, 15L, 35L, 7L, 67L, 3L, 37L, 3L, 73L,
135L, 73L, 84L, 8L, 9L, 89L, 157L, 162L), gc = c(0.23, 0.86,
0.1, 0.4, 0.2, 0.52, 0.84, 0.26, 0.56, 0.37, 0.25, 0.84, 0.36,
0.85, 0.25, 0.64, 0.37, 0.26, 0.36, 0.72, 0.54, 0.35, 0.25, 0.74,
0.63), age = c("adult", "adult", "adult", "adult", "adult", "adult",
"juvenile", "juvenile", "adult", "adult", "juvenile", "juvenile",
"adult", "adult", "adult", "adult", "juvenile", "adult", "adult",
"juvenile", "juvenile", "adult", "adult", "adult", "juvenile"
), Day1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Day2 = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Day3 = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 0L, 1L, 1L), Day4 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L
), Day5 = c(1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), Day6 = c(1L,
1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), Day7 = c(1L, 1L, 1L, 0L, 1L,
1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 0L, 1L, 1L), Day8 = c(1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L,
1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L
), Day9 = c(1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day10 = c(1L,
1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day11 = c(1L, 1L, 1L, 0L, 1L,
1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L,
1L, 0L, 1L, 1L), Day12 = c(1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L,
1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L
), Day13 = c(1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day14 = c(1L,
1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L,
1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day15 = c(1L, 1L, 1L, 0L, 1L,
1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L,
1L, 0L, 1L, 1L), Day16 = c(1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L,
1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L
), Day17 = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L,
1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day18 = c(1L,
0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L,
1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day19 = c(1L, 0L, 1L, 0L, 1L,
1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L,
1L, 0L, 1L, 1L), Day20 = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L,
1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L
), Day21 = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L,
0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day22 = c(1L,
0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L,
1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day23 = c(1L, 0L, 1L, 0L, 1L,
0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L,
1L, 0L, 1L, 1L), Day24 = c(1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L,
1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L
), Day25 = c(1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L,
0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L), Day26 = c(1L,
0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L,
1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L), Day27 = c(1L, 0L, 1L, 0L, 1L,
0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L,
1L, 0L, 1L, 0L), Day28 = c(1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L,
1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L
)), class = "data.frame", row.names = c(NA, -25L))
Now we can look at the data. We know that it is not tidy. The column names are actually the values of the variable, day. We can fix that with pivot_longer.
head(nest_data)
## Nest_ID veg_max gc age Day1 Day2 Day3 Day4 Day5 Day6 Day7 Day8 Day9 Day10
## 1 1 14 0.23 adult 1 1 1 1 1 1 1 1 1 1
## 2 2 36 0.86 adult 1 1 1 1 1 1 1 1 1 1
## 3 3 37 0.10 adult 1 1 1 1 1 1 1 1 1 1
## 4 4 83 0.40 adult 1 1 1 1 0 0 0 0 0 0
## 5 5 63 0.20 adult 1 1 1 1 1 1 1 1 1 1
## 6 6 42 0.52 adult 1 1 1 1 1 1 1 1 1 1
## Day11 Day12 Day13 Day14 Day15 Day16 Day17 Day18 Day19 Day20 Day21 Day22 Day23
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 0 0 0 0 0 0 0
## 3 1 1 1 1 1 1 1 1 1 1 1 1 1
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 1 1 1 1 1 1 1 1 1 1 1 1 1
## 6 1 1 1 1 1 1 1 1 1 1 1 1 0
## Day24 Day25 Day26 Day27 Day28
## 1 1 1 1 1 1
## 2 0 0 0 0 0
## 3 1 1 1 1 1
## 4 0 0 0 0 0
## 5 1 1 1 1 1
## 6 0 0 0 0 0
str(nest_data)
## 'data.frame': 25 obs. of 32 variables:
## $ Nest_ID: int 1 2 3 4 5 6 7 8 9 10 ...
## $ veg_max: int 14 36 37 83 63 42 173 217 135 15 ...
## $ gc : num 0.23 0.86 0.1 0.4 0.2 0.52 0.84 0.26 0.56 0.37 ...
## $ age : chr "adult" "adult" "adult" "adult" ...
## $ Day1 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Day2 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Day3 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Day4 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Day5 : int 1 1 1 0 1 1 1 1 1 1 ...
## $ Day6 : int 1 1 1 0 1 1 1 1 1 1 ...
## $ Day7 : int 1 1 1 0 1 1 1 1 1 1 ...
## $ Day8 : int 1 1 1 0 1 1 1 0 1 1 ...
## $ Day9 : int 1 1 1 0 1 1 1 0 1 1 ...
## $ Day10 : int 1 1 1 0 1 1 1 0 1 1 ...
## $ Day11 : int 1 1 1 0 1 1 1 0 1 1 ...
## $ Day12 : int 1 1 1 0 1 1 1 0 1 1 ...
## $ Day13 : int 1 1 1 0 1 1 1 0 1 1 ...
## $ Day14 : int 1 1 1 0 1 1 1 0 1 1 ...
## $ Day15 : int 1 1 1 0 1 1 1 0 1 1 ...
## $ Day16 : int 1 1 1 0 1 1 1 0 1 1 ...
## $ Day17 : int 1 0 1 0 1 1 1 0 1 1 ...
## $ Day18 : int 1 0 1 0 1 1 1 0 1 1 ...
## $ Day19 : int 1 0 1 0 1 1 1 0 1 1 ...
## $ Day20 : int 1 0 1 0 1 1 1 0 1 1 ...
## $ Day21 : int 1 0 1 0 1 1 1 0 1 1 ...
## $ Day22 : int 1 0 1 0 1 1 1 0 1 1 ...
## $ Day23 : int 1 0 1 0 1 0 1 0 1 1 ...
## $ Day24 : int 1 0 1 0 1 0 1 0 1 1 ...
## $ Day25 : int 1 0 1 0 1 0 1 0 1 1 ...
## $ Day26 : int 1 0 1 0 1 0 1 0 1 1 ...
## $ Day27 : int 1 0 1 0 1 0 1 0 1 1 ...
## $ Day28 : int 1 0 1 0 1 0 1 0 1 1 ...
nest_data %>%
pivot_longer(cols = 5:32, names_to = "day", values_to = "fate",
names_prefix = "Day")
## # A tibble: 700 x 6
## Nest_ID veg_max gc age day fate
## <int> <int> <dbl> <chr> <chr> <int>
## 1 1 14 0.23 adult 1 1
## 2 1 14 0.23 adult 2 1
## 3 1 14 0.23 adult 3 1
## 4 1 14 0.23 adult 4 1
## 5 1 14 0.23 adult 5 1
## 6 1 14 0.23 adult 6 1
## 7 1 14 0.23 adult 7 1
## 8 1 14 0.23 adult 8 1
## 9 1 14 0.23 adult 9 1
## 10 1 14 0.23 adult 10 1
## # ... with 690 more rows
You see that our data set is indeed longer. It duplicated the values for veg_max and others that did not change for each nest. We also used names_prefix to remove the word Day in front of each day number. But what if we measured veg_max for each visit?
nest_data <- nest_data %>%
pivot_longer(cols = 5:32, names_to = "day", values_to = "fate",
names_prefix = "Day") %>%
mutate(veg_max = round(runif(n(), 0, 250), 0))
head(nest_data)
## # A tibble: 6 x 6
## Nest_ID veg_max gc age day fate
## <int> <dbl> <dbl> <chr> <chr> <int>
## 1 1 16 0.23 adult 1 1
## 2 1 185 0.23 adult 2 1
## 3 1 249 0.23 adult 3 1
## 4 1 55 0.23 adult 4 1
## 5 1 106 0.23 adult 5 1
## 6 1 201 0.23 adult 6 1
# We will make our data set wider, then longer again
nest_data <- nest_data %>%
pivot_wider(names_from = day, names_prefix = "Day",
values_from = c(fate, veg_max))
# This is truly hideous... but it is good practice
nest_data
## # A tibble: 25 x 59
## Nest_ID gc age fate_Day1 fate_Day2 fate_Day3 fate_Day4 fate_Day5
## <int> <dbl> <chr> <int> <int> <int> <int> <int>
## 1 1 0.23 adult 1 1 1 1 1
## 2 2 0.86 adult 1 1 1 1 1
## 3 3 0.1 adult 1 1 1 1 1
## 4 4 0.4 adult 1 1 1 1 0
## 5 5 0.2 adult 1 1 1 1 1
## 6 6 0.52 adult 1 1 1 1 1
## 7 7 0.84 juve~ 1 1 1 1 1
## 8 8 0.26 juve~ 1 1 1 1 1
## 9 9 0.56 adult 1 1 1 1 1
## 10 10 0.37 adult 1 1 1 1 1
## # ... with 15 more rows, and 51 more variables: fate_Day6 <int>,
## # fate_Day7 <int>, fate_Day8 <int>, fate_Day9 <int>, fate_Day10 <int>,
## # fate_Day11 <int>, fate_Day12 <int>, fate_Day13 <int>, fate_Day14 <int>,
## # fate_Day15 <int>, fate_Day16 <int>, fate_Day17 <int>, fate_Day18 <int>,
## # fate_Day19 <int>, fate_Day20 <int>, fate_Day21 <int>, fate_Day22 <int>,
## # fate_Day23 <int>, fate_Day24 <int>, fate_Day25 <int>, fate_Day26 <int>,
## # fate_Day27 <int>, fate_Day28 <int>, veg_max_Day1 <dbl>, veg_max_Day2 <dbl>,
## # veg_max_Day3 <dbl>, veg_max_Day4 <dbl>, veg_max_Day5 <dbl>,
## # veg_max_Day6 <dbl>, veg_max_Day7 <dbl>, veg_max_Day8 <dbl>,
## # veg_max_Day9 <dbl>, veg_max_Day10 <dbl>, veg_max_Day11 <dbl>,
## # veg_max_Day12 <dbl>, veg_max_Day13 <dbl>, veg_max_Day14 <dbl>,
## # veg_max_Day15 <dbl>, veg_max_Day16 <dbl>, veg_max_Day17 <dbl>,
## # veg_max_Day18 <dbl>, veg_max_Day19 <dbl>, veg_max_Day20 <dbl>,
## # veg_max_Day21 <dbl>, veg_max_Day22 <dbl>, veg_max_Day23 <dbl>,
## # veg_max_Day24 <dbl>, veg_max_Day25 <dbl>, veg_max_Day26 <dbl>,
## # veg_max_Day27 <dbl>, veg_max_Day28 <dbl>
# Now we will tidy up this mess with some good helper functions from tidyverse
nest_data %>%
pivot_longer(cols = contains("Day"),
names_to = c(".value", "day"),
names_pattern = "(._?.*)_(.*)")
## # A tibble: 700 x 6
## Nest_ID gc age day fate veg_max
## <int> <dbl> <chr> <chr> <int> <dbl>
## 1 1 0.23 adult Day1 1 16
## 2 1 0.23 adult Day2 1 185
## 3 1 0.23 adult Day3 1 249
## 4 1 0.23 adult Day4 1 55
## 5 1 0.23 adult Day5 1 106
## 6 1 0.23 adult Day6 1 201
## 7 1 0.23 adult Day7 1 108
## 8 1 0.23 adult Day8 1 134
## 9 1 0.23 adult Day9 1 13
## 10 1 0.23 adult Day10 1 78
## # ... with 690 more rows
It is often necessary to extract an entire variable or set of variables or observations. This is easily achieved with filter() for observations or select() with variables.
filter() allows you to subset observations based on their value. We will work on the mpg data set included with the tidyverse packages.
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 chevrolet k1500 tah~ 5.7 1999 8 auto(~ 4 11 15 r suv
## 2 dodge dakota pi~ 5.2 1999 8 auto(~ 4 11 15 r pick~
## 3 dodge durango 4~ 5.9 1999 8 auto(~ 4 11 15 r suv
## 4 dodge ram 1500 ~ 5.2 1999 8 auto(~ 4 11 15 r pick~
## 5 dodge ram 1500 ~ 5.9 1999 8 auto(~ 4 11 15 r pick~
## 6 ford f150 pick~ 5.4 1999 8 auto(~ 4 11 15 r pick~
# See what manufacturers we have
unique(mpg$manufacturer)
## [1] "chevrolet" "dodge" "ford" "land rover" "toyota"
## [6] "lincoln" "jeep" "mercury" "nissan" "volkswagen"
## [11] "audi" "subaru" "hyundai" "pontiac" "honda"
# Subset the data to include only Dodge
mpg %>%
filter(manufacturer == "dodge")
## # A tibble: 37 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 dodge dakota p~ 5.2 1999 8 auto(~ 4 11 15 r pick~
## 2 dodge durango ~ 5.9 1999 8 auto(~ 4 11 15 r suv
## 3 dodge ram 1500~ 5.2 1999 8 auto(~ 4 11 15 r pick~
## 4 dodge ram 1500~ 5.9 1999 8 auto(~ 4 11 15 r pick~
## 5 dodge durango ~ 5.2 1999 8 auto(~ 4 11 16 r suv
## 6 dodge ram 1500~ 5.2 1999 8 manua~ 4 11 16 r pick~
## 7 dodge dakota p~ 3.9 1999 6 auto(~ 4 13 17 r pick~
## 8 dodge dakota p~ 3.9 1999 6 manua~ 4 14 17 r pick~
## 9 dodge dakota p~ 5.2 1999 8 manua~ 4 11 17 r pick~
## 10 dodge durango ~ 3.9 1999 6 auto(~ 4 13 17 r suv
## # ... with 27 more rows
# Filter out all but audi
mpg %>%
filter(manufacturer != "audi")
## # A tibble: 216 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 chevrolet k1500 tah~ 5.7 1999 8 auto~ 4 11 15 r suv
## 2 dodge dakota pi~ 5.2 1999 8 auto~ 4 11 15 r pick~
## 3 dodge durango 4~ 5.9 1999 8 auto~ 4 11 15 r suv
## 4 dodge ram 1500 ~ 5.2 1999 8 auto~ 4 11 15 r pick~
## 5 dodge ram 1500 ~ 5.9 1999 8 auto~ 4 11 15 r pick~
## 6 ford f150 pick~ 5.4 1999 8 auto~ 4 11 15 r pick~
## 7 land rover range rov~ 4 1999 8 auto~ 4 11 15 p suv
## 8 land rover range rov~ 4.6 1999 8 auto~ 4 11 15 p suv
## 9 toyota land crui~ 4.7 1999 8 auto~ 4 11 15 r suv
## 10 dodge durango 4~ 5.2 1999 8 auto~ 4 11 16 r suv
## # ... with 206 more rows
# We want to know which cars get at least 20 mpg in the city
mpg %>%
filter(cty >= 20)
## # A tibble: 56 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 subaru impreza~ 2.2 1999 4 auto(~ 4 21 26 r subco~
## 2 toyota camry 2.2 1999 4 auto(~ f 21 27 r midsi~
## 3 toyota camry s~ 2.2 1999 4 auto(~ f 21 27 r compa~
## 4 audi a4 1.8 1999 4 manua~ f 21 29 p compa~
## 5 honda civic 1.6 1999 4 manua~ f 23 29 p subco~
## 6 nissan altima 2.4 1999 4 manua~ f 21 29 r compa~
## 7 toyota camry 2.2 1999 4 manua~ f 21 29 r midsi~
## 8 toyota camry s~ 2.2 1999 4 manua~ f 21 29 r compa~
## 9 volkswagen gti 2 1999 4 manua~ f 21 29 r compa~
## 10 volkswagen jetta 2 1999 4 manua~ f 21 29 r compa~
## # ... with 46 more rows
filter() combines multiple arguments with "and", i.e., every expression must be true. If you want other combinations you can use Boolean operators (e.g., or is |).
# If we want Dodge or Ford manufacturers we use |
mpg %>%
filter(manufacturer == "dodge" | manufacturer == "ford")
## # A tibble: 62 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 dodge dakota p~ 5.2 1999 8 auto(~ 4 11 15 r pick~
## 2 dodge durango ~ 5.9 1999 8 auto(~ 4 11 15 r suv
## 3 dodge ram 1500~ 5.2 1999 8 auto(~ 4 11 15 r pick~
## 4 dodge ram 1500~ 5.9 1999 8 auto(~ 4 11 15 r pick~
## 5 ford f150 pic~ 5.4 1999 8 auto(~ 4 11 15 r pick~
## 6 dodge durango ~ 5.2 1999 8 auto(~ 4 11 16 r suv
## 7 dodge ram 1500~ 5.2 1999 8 manua~ 4 11 16 r pick~
## 8 ford f150 pic~ 4.6 1999 8 manua~ 4 13 16 r pick~
## 9 ford f150 pic~ 4.6 1999 8 auto(~ 4 13 16 r pick~
## 10 dodge dakota p~ 3.9 1999 6 auto(~ 4 13 17 r pick~
## # ... with 52 more rows
# This won't work, because a manufacuter cannot be both Dodge and Ford
mpg %>%
filter(manufacturer == "dodge", manufacturer == "ford")
## # A tibble: 0 x 11
## # ... with 11 variables: manufacturer <chr>, model <chr>, displ <dbl>,
## # year <int>, cyl <int>, trans <chr>, drv <chr>, cty <int>, hwy <int>,
## # fl <chr>, class <chr>
# Here is a shorthand. This will select every row where manufacturer is one of the
# values in the c() function
mpg %>%
filter(manufacturer %in% c("dodge", "ford"))
## # A tibble: 62 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 dodge dakota p~ 5.2 1999 8 auto(~ 4 11 15 r pick~
## 2 dodge durango ~ 5.9 1999 8 auto(~ 4 11 15 r suv
## 3 dodge ram 1500~ 5.2 1999 8 auto(~ 4 11 15 r pick~
## 4 dodge ram 1500~ 5.9 1999 8 auto(~ 4 11 15 r pick~
## 5 ford f150 pic~ 5.4 1999 8 auto(~ 4 11 15 r pick~
## 6 dodge durango ~ 5.2 1999 8 auto(~ 4 11 16 r suv
## 7 dodge ram 1500~ 5.2 1999 8 manua~ 4 11 16 r pick~
## 8 ford f150 pic~ 4.6 1999 8 manua~ 4 13 16 r pick~
## 9 ford f150 pic~ 4.6 1999 8 auto(~ 4 13 16 r pick~
## 10 dodge dakota p~ 3.9 1999 6 auto(~ 4 13 17 r pick~
## # ... with 52 more rows
# You can also randomly select a percentage or set number of rows
mpg %>%
sample_n(25, replace = FALSE)
## # A tibble: 25 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 3.1 2008 6 auto(~ f 18 27 p comp~
## 2 volkswagen gti 2 1999 4 manua~ f 21 29 r comp~
## 3 audi a4 1.8 1999 4 manua~ f 21 29 p comp~
## 4 land rover range ro~ 4 1999 8 auto(~ 4 11 15 p suv
## 5 chevrolet k1500 ta~ 5.7 1999 8 auto(~ 4 11 15 r suv
## 6 dodge dakota p~ 4.7 2008 8 auto(~ 4 14 19 r pick~
## 7 nissan pathfind~ 3.3 1999 6 manua~ 4 15 17 r suv
## 8 chevrolet c1500 su~ 5.3 2008 8 auto(~ r 14 20 r suv
## 9 volkswagen jetta 2.8 1999 6 auto(~ f 16 23 r comp~
## 10 ford f150 pic~ 4.6 1999 8 manua~ 4 13 16 r pick~
## # ... with 15 more rows
mpg %>%
sample_frac(0.10, replace = FALSE)
## # A tibble: 23 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 toyota toyota ~ 2.7 1999 4 manua~ 4 15 20 r pickup
## 2 dodge caravan~ 3.3 2008 6 auto(~ f 11 17 e miniv~
## 3 volkswagen new bee~ 2.5 2008 5 manua~ f 20 28 r subco~
## 4 dodge caravan~ 3.8 1999 6 auto(~ f 15 21 r miniv~
## 5 chevrolet c1500 s~ 5.7 1999 8 auto(~ r 13 17 r suv
## 6 ford f150 pi~ 4.6 2008 8 auto(~ 4 13 17 r pickup
## 7 nissan altima 2.4 1999 4 manua~ f 21 29 r compa~
## 8 dodge durango~ 5.9 1999 8 auto(~ 4 11 15 r suv
## 9 nissan altima 2.5 2008 4 manua~ f 23 32 r midsi~
## 10 audi a4 quat~ 2.8 1999 6 auto(~ 4 15 25 p compa~
## # ... with 13 more rows
Sometimes we might want to subset observations containing a particular sequence of letters. In our mpg data set, there are many different models that contain "4wd", more than we would want to write out. Instead, we can detect the sting using str_detect().
mpg %>%
filter(str_detect(model,'4wd'))
## # A tibble: 74 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 chevrolet k1500 ta~ 5.7 1999 8 auto(~ 4 11 15 r suv
## 2 dodge dakota p~ 5.2 1999 8 auto(~ 4 11 15 r pick~
## 3 dodge durango ~ 5.9 1999 8 auto(~ 4 11 15 r suv
## 4 dodge ram 1500~ 5.2 1999 8 auto(~ 4 11 15 r pick~
## 5 dodge ram 1500~ 5.9 1999 8 auto(~ 4 11 15 r pick~
## 6 ford f150 pic~ 5.4 1999 8 auto(~ 4 11 15 r pick~
## 7 toyota land cru~ 4.7 1999 8 auto(~ 4 11 15 r suv
## 8 dodge durango ~ 5.2 1999 8 auto(~ 4 11 16 r suv
## 9 dodge ram 1500~ 5.2 1999 8 manua~ 4 11 16 r pick~
## 10 ford f150 pic~ 4.6 1999 8 manua~ 4 13 16 r pick~
## # ... with 64 more rows
It is not uncommon to get a large data set with variables that aren't needed for a specific analysis. select() allows you to extract certain variables for easy data sub setting.
# We can select certain variables by name
mpg %>%
select(model, displ, hwy)
## # A tibble: 234 x 3
## model displ hwy
## <chr> <dbl> <int>
## 1 k1500 tahoe 4wd 5.7 15
## 2 dakota pickup 4wd 5.2 15
## 3 durango 4wd 5.9 15
## 4 ram 1500 pickup 4wd 5.2 15
## 5 ram 1500 pickup 4wd 5.9 15
## 6 f150 pickup 4wd 5.4 15
## 7 range rover 4 15
## 8 range rover 4.6 15
## 9 land cruiser wagon 4wd 4.7 15
## 10 durango 4wd 5.2 16
## # ... with 224 more rows
# We can select all but named variables
mpg %>%
select(-model, -year, -hwy)
## # A tibble: 234 x 8
## manufacturer displ cyl trans drv cty fl class
## <chr> <dbl> <int> <chr> <chr> <int> <chr> <chr>
## 1 chevrolet 5.7 8 auto(l4) 4 11 r suv
## 2 dodge 5.2 8 auto(l4) 4 11 r pickup
## 3 dodge 5.9 8 auto(l4) 4 11 r suv
## 4 dodge 5.2 8 auto(l4) 4 11 r pickup
## 5 dodge 5.9 8 auto(l4) 4 11 r pickup
## 6 ford 5.4 8 auto(l4) 4 11 r pickup
## 7 land rover 4 8 auto(l4) 4 11 p suv
## 8 land rover 4.6 8 auto(l4) 4 11 p suv
## 9 toyota 4.7 8 auto(l4) 4 11 r suv
## 10 dodge 5.2 8 auto(l4) 4 11 r suv
## # ... with 224 more rows
# We can select by position
mpg %>%
select(1, 5, 6)
## # A tibble: 234 x 3
## manufacturer cyl trans
## <chr> <int> <chr>
## 1 chevrolet 8 auto(l4)
## 2 dodge 8 auto(l4)
## 3 dodge 8 auto(l4)
## 4 dodge 8 auto(l4)
## 5 dodge 8 auto(l4)
## 6 ford 8 auto(l4)
## 7 land rover 8 auto(l4)
## 8 land rover 8 auto(l4)
## 9 toyota 8 auto(l4)
## 10 dodge 8 auto(l4)
## # ... with 224 more rows
# We can select all columns from 2:6, we could also do that by name
mpg %>%
select(2:6)
## # A tibble: 234 x 5
## model displ year cyl trans
## <chr> <dbl> <int> <int> <chr>
## 1 k1500 tahoe 4wd 5.7 1999 8 auto(l4)
## 2 dakota pickup 4wd 5.2 1999 8 auto(l4)
## 3 durango 4wd 5.9 1999 8 auto(l4)
## 4 ram 1500 pickup 4wd 5.2 1999 8 auto(l4)
## 5 ram 1500 pickup 4wd 5.9 1999 8 auto(l4)
## 6 f150 pickup 4wd 5.4 1999 8 auto(l4)
## 7 range rover 4 1999 8 auto(l4)
## 8 range rover 4.6 1999 8 auto(l4)
## 9 land cruiser wagon 4wd 4.7 1999 8 auto(l4)
## 10 durango 4wd 5.2 1999 8 auto(l4)
## # ... with 224 more rows
You can also use some of the helper functions to select certain variables:
# Select variables that contain "y"
mpg %>%
select(contains('y'))
## # A tibble: 234 x 4
## year cyl cty hwy
## <int> <int> <int> <int>
## 1 1999 8 11 15
## 2 1999 8 11 15
## 3 1999 8 11 15
## 4 1999 8 11 15
## 5 1999 8 11 15
## 6 1999 8 11 15
## 7 1999 8 11 15
## 8 1999 8 11 15
## 9 1999 8 11 15
## 10 1999 8 11 16
## # ... with 224 more rows
# Select variables that start with "c"
mpg %>% select(starts_with('c'))
## # A tibble: 234 x 3
## cyl cty class
## <int> <int> <chr>
## 1 8 11 suv
## 2 8 11 pickup
## 3 8 11 suv
## 4 8 11 pickup
## 5 8 11 pickup
## 6 8 11 pickup
## 7 8 11 suv
## 8 8 11 suv
## 9 8 11 suv
## 10 8 11 suv
## # ... with 224 more rows
# Select can also be used to rearrange you data
mpg %>% select(class, displ, year, everything())
## # A tibble: 234 x 11
## class displ year manufacturer model cyl trans drv cty hwy fl
## <chr> <dbl> <int> <chr> <chr> <int> <chr> <chr> <int> <int> <chr>
## 1 suv 5.7 1999 chevrolet k1500 ta~ 8 auto~ 4 11 15 r
## 2 pickup 5.2 1999 dodge dakota p~ 8 auto~ 4 11 15 r
## 3 suv 5.9 1999 dodge durango ~ 8 auto~ 4 11 15 r
## 4 pickup 5.2 1999 dodge ram 1500~ 8 auto~ 4 11 15 r
## 5 pickup 5.9 1999 dodge ram 1500~ 8 auto~ 4 11 15 r
## 6 pickup 5.4 1999 ford f150 pic~ 8 auto~ 4 11 15 r
## 7 suv 4 1999 land rover range ro~ 8 auto~ 4 11 15 p
## 8 suv 4.6 1999 land rover range ro~ 8 auto~ 4 11 15 p
## 9 suv 4.7 1999 toyota land cru~ 8 auto~ 4 11 15 r
## 10 suv 5.2 1999 dodge durango ~ 8 auto~ 4 11 16 r
## # ... with 224 more rows
It is often necessary to create new variables, either from scratch or conditional on other variables. In the tidyverse package, mutate() will add new columns at the end of your data set. We will start by subsetting an example data set.
(sml_flights <- select(flights, year:day, ends_with("delay"), distance, air_time))
## # A tibble: 336,776 x 7
## year month day dep_delay arr_delay distance air_time
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227
## 2 2013 1 1 4 20 1416 227
## 3 2013 1 1 2 33 1089 160
## 4 2013 1 1 -1 -18 1576 183
## 5 2013 1 1 -6 -25 762 116
## 6 2013 1 1 -4 12 719 150
## 7 2013 1 1 -5 19 1065 158
## 8 2013 1 1 -3 -14 229 53
## 9 2013 1 1 -3 -8 944 140
## 10 2013 1 1 -2 8 733 138
## # ... with 336,766 more rows
# Now create new columns using mutate()
sml_flights %>%
mutate(gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 10
## year month day dep_delay arr_delay distance air_time gain hours
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227 -9 3.78
## 2 2013 1 1 4 20 1416 227 -16 3.78
## 3 2013 1 1 2 33 1089 160 -31 2.67
## 4 2013 1 1 -1 -18 1576 183 17 3.05
## 5 2013 1 1 -6 -25 762 116 19 1.93
## 6 2013 1 1 -4 12 719 150 -16 2.5
## 7 2013 1 1 -5 19 1065 158 -24 2.63
## 8 2013 1 1 -3 -14 229 53 11 0.883
## 9 2013 1 1 -3 -8 944 140 5 2.33
## 10 2013 1 1 -2 8 733 138 -10 2.3
## # ... with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>
# If you only want to keep the new variable, use transmute()
sml_flights %>%
transmute(gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 3
## gain hours gain_per_hour
## <dbl> <dbl> <dbl>
## 1 -9 3.78 -2.38
## 2 -16 3.78 -4.23
## 3 -31 2.67 -11.6
## 4 17 3.05 5.57
## 5 19 1.93 9.83
## 6 -16 2.5 -6.4
## 7 -24 2.63 -9.11
## 8 11 0.883 12.5
## 9 5 2.33 2.14
## 10 -10 2.3 -4.35
## # ... with 336,766 more rows
You can use all of the familiar arithmetic operators (e.g., +, -, ^) with mutate. Modular arithmetic is a handy tool because it allows you to break integers up into pieces. %/% is integer division and %% is the remainder. For example, in the flights data set you can compute the hour and minute from the depart time:
flights %>%
transmute(dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100)
## # A tibble: 336,776 x 3
## dep_time hour minute
## <int> <dbl> <dbl>
## 1 517 5 17
## 2 533 5 33
## 3 542 5 42
## 4 544 5 44
## 5 554 5 54
## 6 554 5 54
## 7 555 5 55
## 8 557 5 57
## 9 557 5 57
## 10 558 5 58
## # ... with 336,766 more rows
You can also determine offsets or running differences using lead() and lag().
x <- 1:10
lag(x)
## [1] NA 1 2 3 4 5 6 7 8 9
lead(x)
## [1] 2 3 4 5 6 7 8 9 10 NA
# Running differences
x - lag(x)
## [1] NA 1 1 1 1 1 1 1 1 1
# Find when values change
(x != lag(x))
## [1] NA TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# There are also cumulative and rolling aggregates
cumsum(x)
## [1] 1 3 6 10 15 21 28 36 45 55
cummean(x)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
When you are creating a new variable based on certain conditions being met in other variables you can use case_when instead of stacking ifelse statements.
flights %>%
transmute(dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100,
time_category = case_when(
hour <= 8 ~ "early",
hour > 8 & hour <= 18 ~ "good",
hour > 18 ~ "late"
)) %>%
sample_n(20, replace = FALSE)
## # A tibble: 20 x 4
## dep_time hour minute time_category
## <int> <dbl> <dbl> <chr>
## 1 1031 10 31 good
## 2 1806 18 6 good
## 3 1330 13 30 good
## 4 NA NA NA <NA>
## 5 1117 11 17 good
## 6 2058 20 58 late
## 7 1654 16 54 good
## 8 1503 15 3 good
## 9 903 9 3 good
## 10 1843 18 43 good
## 11 1624 16 24 good
## 12 1226 12 26 good
## 13 1239 12 39 good
## 14 2040 20 40 late
## 15 1740 17 40 good
## 16 1603 16 3 good
## 17 605 6 5 early
## 18 1139 11 39 good
## 19 1027 10 27 good
## 20 1509 15 9 good
Calculating data summary statistics is also necessary. This is easily achieved with summarise() and also helper functions like group_by() or n(). summarise() will collapse data to a single row:
summarise(flights,
mean_delay = mean(dep_delay, na.rm = TRUE),
sd_delay = sd(dep_delay, na.rm = TRUE),
min_delay = min(dep_delay, na.rm = TRUE),
max_delay = max(dep_delay, na.rm = TRUE),
distinct_delay = n_distinct(dep_delay),
nrow = n())
## # A tibble: 1 x 6
## mean_delay sd_delay min_delay max_delay distinct_delay nrow
## <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 12.6 40.2 -43 1301 528 336776
The utility of summarise() isn't fully realized until you pair it with group_by(). This changes the unit of analysis from the complete data set to the group. Then, you can apply functions to the group.
# Get average delay stats per day. Each unique group gets it's own row
flights %>%
group_by(year, month, day) %>%
summarise(mean_delay = mean(dep_delay, na.rm = TRUE),
sd_delay = sd(dep_delay, na.rm = TRUE),
min_delay = min(dep_delay, na.rm = TRUE),
max_delay = max(dep_delay, na.rm = TRUE),
distinct_delay = n_distinct(dep_delay),
nrow = n())
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 9
## # Groups: year, month [12]
## year month day mean_delay sd_delay min_delay max_delay distinct_delay
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 2013 1 1 11.5 45.3 -15 853 108
## 2 2013 1 2 13.9 37.2 -13 379 119
## 3 2013 1 3 11.0 31.5 -13 291 113
## 4 2013 1 4 8.95 27.7 -19 288 108
## 5 2013 1 5 5.73 25.7 -16 327 83
## 6 2013 1 6 7.15 23.2 -15 202 93
## 7 2013 1 7 5.42 27.1 -17 366 97
## 8 2013 1 8 2.55 19.3 -16 188 86
## 9 2013 1 9 2.28 46.9 -17 1301 75
## 10 2013 1 10 2.84 43.9 -17 1126 84
## # ... with 355 more rows, and 1 more variable: nrow <int>
# Group by day only
flights %>%
group_by(day) %>%
summarise(mean_delay = mean(dep_delay, na.rm = TRUE),
sd_delay = sd(dep_delay, na.rm = TRUE),
min_delay = min(dep_delay, na.rm = TRUE),
max_delay = max(dep_delay, na.rm = TRUE),
distinct_delay = n_distinct(dep_delay),
nrow = n())
## # A tibble: 31 x 7
## day mean_delay sd_delay min_delay max_delay distinct_delay nrow
## * <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 1 14.2 41.8 -23 853 286 11036
## 2 2 14.1 43.9 -24 696 301 10808
## 3 3 10.8 36.8 -33 878 266 11211
## 4 4 5.79 27.2 -20 545 222 11059
## 5 5 7.82 37.5 -24 896 281 10858
## 6 6 6.99 29.8 -22 589 225 11059
## 7 7 14.3 43.4 -43 653 298 10985
## 8 8 21.8 53.0 -21 520 333 11271
## 9 9 14.6 40.7 -26 1301 266 10857
## 10 10 18.3 51.8 -32 1126 322 11227
## # ... with 21 more rows
# Find the number of flights that were early each day
flights %>%
mutate(dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100,
time_category = case_when(
hour <= 8 ~ "early",
hour > 8 & hour <= 18 ~ "good",
hour > 18 ~ "late"
)) %>%
group_by(year, month, day) %>%
summarise(n_early = sum(time_category == "early", na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day n_early
## <int> <int> <int> <int>
## 1 2013 1 1 169
## 2 2013 1 2 224
## 3 2013 1 3 214
## 4 2013 1 4 223
## 5 2013 1 5 183
## 6 2013 1 6 157
## 7 2013 1 7 222
## 8 2013 1 8 228
## 9 2013 1 9 226
## 10 2013 1 10 234
## # ... with 355 more rows
#OR
flights %>%
mutate(dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100,
time_category = case_when(
hour <= 8 ~ "early",
hour > 8 & hour <= 18 ~ "good",
hour > 18 ~ "late"
)) %>%
group_by(year, month, day, time_category) %>%
summarise(n_cat = n())
## `summarise()` has grouped output by 'year', 'month', 'day'. You can override using the `.groups` argument.
## # A tibble: 1,451 x 5
## # Groups: year, month, day [365]
## year month day time_category n_cat
## <int> <int> <int> <chr> <int>
## 1 2013 1 1 early 169
## 2 2013 1 1 good 526
## 3 2013 1 1 late 143
## 4 2013 1 1 <NA> 4
## 5 2013 1 2 early 224
## 6 2013 1 2 good 555
## 7 2013 1 2 late 156
## 8 2013 1 2 <NA> 8
## 9 2013 1 3 early 214
## 10 2013 1 3 good 537
## # ... with 1,441 more rows
When working with grouped data, you can use ungroup() to strip away all groups and apply functions to the data set as a whole and not by group. group_by() can be combined with all other tidyverse functions such as mutate() and filter().
# Worst delays for each group
sml_flights %>%
group_by(year, month, day) %>%
filter(rank(desc(arr_delay)) < 10)
## # A tibble: 3,306 x 7
## # Groups: year, month, day [365]
## year month day dep_delay arr_delay distance air_time
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 853 851 184 41
## 2 2013 1 1 290 338 1134 213
## 3 2013 1 1 260 263 266 46
## 4 2013 1 1 157 174 213 60
## 5 2013 1 1 216 222 708 121
## 6 2013 1 1 255 250 589 115
## 7 2013 1 1 285 246 1085 146
## 8 2013 1 1 192 191 199 44
## 9 2013 1 1 379 456 1092 222
## 10 2013 1 2 224 207 550 94
## # ... with 3,296 more rows
# Groups bigger than threashold
flights %>%
group_by(dest) %>%
filter(n() > 365) %>%
select(dest, everything())
## # A tibble: 332,577 x 19
## # Groups: dest [77]
## dest year month day dep_time sched_dep_time dep_delay arr_time
## <chr> <int> <int> <int> <int> <int> <dbl> <int>
## 1 IAH 2013 1 1 517 515 2 830
## 2 IAH 2013 1 1 533 529 4 850
## 3 MIA 2013 1 1 542 540 2 923
## 4 BQN 2013 1 1 544 545 -1 1004
## 5 ATL 2013 1 1 554 600 -6 812
## 6 ORD 2013 1 1 554 558 -4 740
## 7 FLL 2013 1 1 555 600 -5 913
## 8 IAD 2013 1 1 557 600 -3 709
## 9 MCO 2013 1 1 557 600 -3 838
## 10 ORD 2013 1 1 558 600 -2 753
## # ... with 332,567 more rows, and 11 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# Then standardize to compare group stats
flights %>%
group_by(dest) %>%
filter(n() > 365) %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)
## # A tibble: 131,106 x 6
## # Groups: dest [77]
## year month day dest arr_delay prop_delay
## <int> <int> <int> <chr> <dbl> <dbl>
## 1 2013 1 1 IAH 11 0.000111
## 2 2013 1 1 IAH 20 0.000201
## 3 2013 1 1 MIA 33 0.000235
## 4 2013 1 1 ORD 12 0.0000424
## 5 2013 1 1 FLL 19 0.0000938
## 6 2013 1 1 ORD 8 0.0000283
## 7 2013 1 1 LAX 7 0.0000344
## 8 2013 1 1 DFW 31 0.000282
## 9 2013 1 1 ATL 12 0.0000400
## 10 2013 1 1 DTW 16 0.000116
## # ... with 131,096 more rows
Other ways to create new variables is by separating or uniting existing variables. separate() pulls apart one column into multiple columns by splitting based on a separator character or regex.
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
# Now split the rate into cases and population size
table3 %>%
separate(col = rate, into = c("cases", "population"))
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
# Notice that cases and population are character columns. We can include convert=TRUE
table3 %>%
separate(col = rate, into = c("cases", "population"), convert = TRUE)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
# Separate by position
table3 %>%
separate(year, into = c("century", "decade"), sep = 2, convert = TRUE)
## # A tibble: 6 x 4
## country century decade rate
## <chr> <int> <int> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 0 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 0 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 0 213766/1280428583
By default it will split at any non-alphanumeric character. But you can also use any regular expression by including the sep = argument.
Now for a more complicated example. Lets create a super messy data set, and clean it up.
messy_data <- data.frame(
state = sample(c("TN", "KY", "GA", "AL"), size = 40, replace = TRUE,
prob = c(0.3, 0.2, 0.25, 0.25)),
ID = 1:40,
w1 = round(runif(40, 10, 900), 0),
w2 = round(runif(40, 10, 900), 0),
w3 = sample(c(round(runif(1, 10, 900), 0), NA), size = 40, replace = TRUE,
prob = c(0.8, 0.2)),
w4 = sample(c(round(runif(1, 10, 900), 0), NA), size = 40, replace = TRUE,
prob = c(0.4, 0.6))) %>%
unite(wgt, starts_with("w"), sep = ",", na.rm = TRUE) %>%
unite(dat, state, ID, wgt, sep = "_")
messy_data %>%
head()
## dat
## 1 GA_1_229,171,387,549
## 2 TN_2_60,878,387,549
## 3 TN_3_599,99
## 4 GA_4_605,125,387
## 5 GA_5_76,663,387
## 6 GA_6_318,216,387
# Now separte our state, ID, and weights
messy_data %>%
separate(dat, into = c("State", "ID", "Wgts"), sep = "_") %>%
head()
## State ID Wgts
## 1 GA 1 229,171,387,549
## 2 TN 2 60,878,387,549
## 3 TN 3 599,99
## 4 GA 4 605,125,387
## 5 GA 5 76,663,387
## 6 GA 6 318,216,387
# Now we are going str_split() to split our strings, it will create a list of our splits
messy_data %>%
separate(dat, into = c("State", "ID", "Wgts"), sep = "_") %>%
mutate(wgt_list = str_split(Wgts, pattern = ",")) %>%
select(-Wgts) %>%
head()
## State ID wgt_list
## 1 GA 1 229, 171, 387, 549
## 2 TN 2 60, 878, 387, 549
## 3 TN 3 599, 99
## 4 GA 4 605, 125, 387
## 5 GA 5 76, 663, 387
## 6 GA 6 318, 216, 387
# We will unnest() the list column to create a long-form table
messy_data %>%
separate(dat, into = c("State", "ID", "Wgts"), sep = "_") %>%
mutate(wgt_list = str_split(Wgts, pattern = ",")) %>%
select(-Wgts) %>%
unnest(col = wgt_list)
## # A tibble: 127 x 3
## State ID wgt_list
## <chr> <chr> <chr>
## 1 GA 1 229
## 2 GA 1 171
## 3 GA 1 387
## 4 GA 1 549
## 5 TN 2 60
## 6 TN 2 878
## 7 TN 2 387
## 8 TN 2 549
## 9 TN 3 599
## 10 TN 3 99
## # ... with 117 more rows
The opposite of separate() is unite(), which combines multiple columns into a single column. It is most useful if you want to combine multiple fields (like capture year, capture number, and sex) to create unique and informative ID numbers.
table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
# The default unites with the _ as a separator
table5 %>%
unite(new, century, year)
## # A tibble: 6 x 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
# But you can change that to whatever you want
table5 %>%
unite(new, century, year, sep = "")
## # A tibble: 6 x 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
Most analyses will involve multiple tables or excel sheets of data. Maybe you have one file with a list of ID numbers and individual characteristics like sex, age, capture date, etc. You might have a separate file with all the recaptures for every individual that doesn't include all the individual characteristics. These multiple excel tables are relational data and can be joined to combine the information contained in them.
We will dive into joins using the nycflights13 data that contains 4 tibbles
head(airlines)
## # A tibble: 6 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
head(airports)
## # A tibble: 6 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_Y~
## 2 06A Moton Field Municipal Airp~ 32.5 -85.7 264 -6 A America/Chica~
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chica~
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_Y~
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_Y~
## 6 0A9 Elizabethton Municipal Air~ 36.4 -82.2 1593 -5 A America/New_Y~
head(planes)
## # A tibble: 6 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing mu~ EMBRAER EMB-1~ 2 55 NA Turbo-~
## 2 N102UW 1998 Fixed wing mu~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 3 N103US 1999 Fixed wing mu~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 4 N104UW 1999 Fixed wing mu~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 5 N10575 2002 Fixed wing mu~ EMBRAER EMB-1~ 2 55 NA Turbo-~
## 6 N105UW 1999 Fixed wing mu~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
head(weather)
## # A tibble: 6 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
## # ... with 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## # time_hour <dttm>
Mutating joins combine variables from two tables. It first matches observations by their keys (unique identifiers) then copies across variables from one table to another.
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
head(flights2)
## # A tibble: 6 x 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
# Add the full airline number
flights2 %>%
select(-origin, - dest) %>%
left_join(airlines, by = "carrier")
## # A tibble: 336,776 x 7
## year month day hour tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr>
## 1 2013 1 1 5 N14228 UA United Air Lines Inc.
## 2 2013 1 1 5 N24211 UA United Air Lines Inc.
## 3 2013 1 1 5 N619AA AA American Airlines Inc.
## 4 2013 1 1 5 N804JB B6 JetBlue Airways
## 5 2013 1 1 6 N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5 N39463 UA United Air Lines Inc.
## 7 2013 1 1 6 N516JB B6 JetBlue Airways
## 8 2013 1 1 6 N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6 N593JB B6 JetBlue Airways
## 10 2013 1 1 6 N3ALAA AA American Airlines Inc.
## # ... with 336,766 more rows
There are different types of mutating joins. First is the inner join that matches pairs of observations whenever the keys are equal and keeps observations that appear in both tables.
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)
x
## # A tibble: 3 x 2
## key val_x
## <dbl> <chr>
## 1 1 x1
## 2 2 x2
## 3 3 x3
y
## # A tibble: 3 x 2
## key val_y
## <dbl> <chr>
## 1 1 y1
## 2 2 y2
## 3 4 y3
# Inner join
x %>%
inner_join(y, by = "key")
## # A tibble: 2 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
Outer joins keep observations that appear in at least one of the tables. A left join keeps all observations in x, a right join keeps all observations in y, and a full join keeps all observations of x and y.
# Left join
x %>% left_join(y, by = "key")
## # A tibble: 3 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
## 3 3 x3 <NA>
# Right join
x %>% right_join(y, by = "key")
## # A tibble: 3 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
## 3 4 <NA> y3
# Full join
x %>% full_join(y, by = "key")
## # A tibble: 4 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
## 3 3 x3 <NA>
## 4 4 <NA> y3
If you don't define the key (i.e., the by argument) then it uses all column names that match to generate the key.
# Joining by year, month, day, hour, and origin
flights2 %>%
left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
## year month day hour origin dest tailnum carrier temp dewp humid
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4
## 2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8
## 3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6
## 4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6
## 5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
## 6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4
## 7 2013 1 1 6 EWR FLL N516JB B6 37.9 28.0 67.2
## 8 2013 1 1 6 LGA IAD N829AS EV 39.9 25.0 54.8
## 9 2013 1 1 6 JFK MCO N593JB B6 37.9 27.0 64.3
## 10 2013 1 1 6 LGA ORD N3ALAA AA 39.9 25.0 54.8
## # ... with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## # wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>
If the key doesn't have matching column names, you can used a named character vector, e.g., by = c("a" = "b") which will match x$a to y$b.
head(flights2)
## # A tibble: 6 x 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
# Get lat and lon information for the airport based on destination airport
flights2 %>%
left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 x 15
## year month day hour origin dest tailnum carrier name lat lon alt
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA Geor~ 30.0 -95.3 97
## 2 2013 1 1 5 LGA IAH N24211 UA Geor~ 30.0 -95.3 97
## 3 2013 1 1 5 JFK MIA N619AA AA Miam~ 25.8 -80.3 8
## 4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA NA NA
## 5 2013 1 1 6 LGA ATL N668DN DL Hart~ 33.6 -84.4 1026
## 6 2013 1 1 5 EWR ORD N39463 UA Chic~ 42.0 -87.9 668
## 7 2013 1 1 6 EWR FLL N516JB B6 Fort~ 26.1 -80.2 9
## 8 2013 1 1 6 LGA IAD N829AS EV Wash~ 38.9 -77.5 313
## 9 2013 1 1 6 JFK MCO N593JB B6 Orla~ 28.4 -81.3 96
## 10 2013 1 1 6 LGA ORD N3ALAA AA Chic~ 42.0 -87.9 668
## # ... with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
## # tzone <chr>
# Now based on origin airport
flights2 %>%
left_join(airports, c("origin" = "faa"))
## # A tibble: 336,776 x 15
## year month day hour origin dest tailnum carrier name lat lon alt
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA Newa~ 40.7 -74.2 18
## 2 2013 1 1 5 LGA IAH N24211 UA La G~ 40.8 -73.9 22
## 3 2013 1 1 5 JFK MIA N619AA AA John~ 40.6 -73.8 13
## 4 2013 1 1 5 JFK BQN N804JB B6 John~ 40.6 -73.8 13
## 5 2013 1 1 6 LGA ATL N668DN DL La G~ 40.8 -73.9 22
## 6 2013 1 1 5 EWR ORD N39463 UA Newa~ 40.7 -74.2 18
## 7 2013 1 1 6 EWR FLL N516JB B6 Newa~ 40.7 -74.2 18
## 8 2013 1 1 6 LGA IAD N829AS EV La G~ 40.8 -73.9 22
## 9 2013 1 1 6 JFK MCO N593JB B6 John~ 40.6 -73.8 13
## 10 2013 1 1 6 LGA ORD N3ALAA AA La G~ 40.8 -73.9 22
## # ... with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
## # tzone <chr>
Filtering joins match observations similar to mutating joins, but affect the observations, not the variables. A semi_join(x, y) keeps all observations in x that have a match in y. An anti_join(x, y) drops all observations in x that have a match in y.
# Top destinations
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
top_dest
## # A tibble: 10 x 2
## dest n
## <chr> <int>
## 1 ORD 17283
## 2 ATL 17215
## 3 LAX 16174
## 4 BOS 15508
## 5 MCO 14082
## 6 CLT 14064
## 7 SFO 13331
## 8 FLL 12055
## 9 MIA 11728
## 10 DCA 9705
# Now find each flight that went to one of the top destinations
flights %>%
semi_join(top_dest)
## Joining, by = "dest"
## # A tibble: 141,145 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 542 540 2 923 850
## 2 2013 1 1 554 600 -6 812 837
## 3 2013 1 1 554 558 -4 740 728
## 4 2013 1 1 555 600 -5 913 854
## 5 2013 1 1 557 600 -3 838 846
## 6 2013 1 1 558 600 -2 753 745
## 7 2013 1 1 558 600 -2 924 917
## 8 2013 1 1 558 600 -2 923 937
## 9 2013 1 1 559 559 0 702 706
## 10 2013 1 1 600 600 0 851 858
## # ... with 141,135 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# We can find out how many flights don't have a match in planes
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
## # A tibble: 722 x 2
## tailnum n
## <chr> <int>
## 1 <NA> 2512
## 2 N725MQ 575
## 3 N722MQ 513
## 4 N723MQ 507
## 5 N713MQ 483
## 6 N735MQ 396
## 7 N0EGMQ 371
## 8 N534MQ 364
## 9 N542MQ 363
## 10 N531MQ 349
## # ... with 712 more rows