Introduction

Course notes from the Joining Data in R with dplyr course on DataCamp

Whats Covered

Mutating joins
Filtering joins and set operations
Assembling data
Advanced joining
Case Study

Aditional Resources

Mutating Joins

Welcome to the course!

Advantages to dplyr over base R merge function for joing data
- dplyr always preserves the row order
- dplyr has much more intuitive syntax
- dplyr can be applied to databases, or spark
dplyr is a front end language for maniulating data that can be converted to multiple backends like SQL or spark.
- This is pretty useful. Everything we cover in this class could work on a spark dataframe or a sql table

Keys

The Primary key needs to be unique in a table
The foreign key in the second table can be duplicated
second table will be matched to the primary table based on the primary key
The primary key may be one, two or even more columns in the table

Primary Keys

What is the primary key in artists dataset?
- name

head(artists0, 10)

## # A tibble: 10 x 2
##               name instrument
##              <chr>      <chr>
##  1   Jimmy Buffett     Guitar
##  2 George Harrison     Guitar
##  3     Mick Jagger     Vocals
##  4       Tom Jones     Vocals
##  5      Davy Jones     Vocals
##  6     John Lennon     Guitar
##  7  Paul McCartney       Bass
##  8      Jimmy Page     Guitar
##  9       Joe Perry     Guitar
## 10   Elvis Presley     Vocals

Secondary Keys

Which datasets have a primary key that matches artists$name?
- bands$name
- songs$writer

head(albums0)

## # A tibble: 6 x 3
##                        name               band  year
##                       <chr>              <chr> <int>
## 1        A Hard Day's Night        The Beatles  1964
## 2      Magical Mystery Tour        The Beatles  1967
## 3          Beggar's Banquet The Rolling Stones  1968
## 4                Abbey Road        The Beatles  1969
## 5           Led Zeppelin IV       Led Zeppelin  1971
## 6 The Dark Side of the Moon         Pink Floyd  1973

head(bands0)

## # A tibble: 6 x 2
##              name         band
##             <chr>        <chr>
## 1     John Bonham Led Zeppelin
## 2 John Paul Jones Led Zeppelin
## 3      Jimmy Page Led Zeppelin
## 4    Robert Plant Led Zeppelin
## 5 George Harrison  The Beatles
## 6     John Lennon  The Beatles

head(songs0)

## # A tibble: 4 x 3
##               song                album         writer
##              <chr>                <chr>          <chr>
## 1    Come Together           Abbey Road    John Lennon
## 2         Dream On            Aerosmith   Steven Tyler
## 3   Hello, Goodbye Magical Mystery Tour Paul McCartney
## 4 It's Not Unusual     Along Came Jones      Tom Jones

Multi-variable keys

What is the primary key in artists?
- The combination of first and last

head(artists)

## # A tibble: 6 x 3
##    first     last instrument
##    <chr>    <chr>      <chr>
## 1  Jimmy  Buffett     Guitar
## 2 George Harrison     Guitar
## 3   Mick   Jagger     Vocals
## 4    Tom    Jones     Vocals
## 5   Davy    Jones     Vocals
## 6   John   Lennon     Guitar

Joins

Joins in dplyr will work on data frames, tibbles (tbl_df), and tbl references

A basic join

head(artists)

## # A tibble: 6 x 3
##    first     last instrument
##    <chr>    <chr>      <chr>
## 1  Jimmy  Buffett     Guitar
## 2 George Harrison     Guitar
## 3   Mick   Jagger     Vocals
## 4    Tom    Jones     Vocals
## 5   Davy    Jones     Vocals
## 6   John   Lennon     Guitar

head(bands)

## # A tibble: 6 x 3
##       first     last         band
##       <chr>    <chr>        <chr>
## 1      John   Bonham Led Zeppelin
## 2 John Paul    Jones Led Zeppelin
## 3     Jimmy     Page Led Zeppelin
## 4    Robert    Plant Led Zeppelin
## 5    George Harrison  The Beatles
## 6      John   Lennon  The Beatles

# Complete the code to join artists to bands
bands2 <- left_join(bands, artists, by = c('first','last'))

# Examine the results
bands2

## # A tibble: 13 x 4
##        first      last               band instrument
##        <chr>     <chr>              <chr>      <chr>
##  1      John    Bonham       Led Zeppelin       <NA>
##  2 John Paul     Jones       Led Zeppelin       <NA>
##  3     Jimmy      Page       Led Zeppelin     Guitar
##  4    Robert     Plant       Led Zeppelin       <NA>
##  5    George  Harrison        The Beatles     Guitar
##  6      John    Lennon        The Beatles     Guitar
##  7      Paul McCartney        The Beatles       Bass
##  8     Ringo     Starr        The Beatles      Drums
##  9     Jimmy   Buffett  The Coral Reefers     Guitar
## 10      Mick    Jagger The Rolling Stones     Vocals
## 11     Keith  Richards The Rolling Stones     Guitar
## 12   Charlie     Watts The Rolling Stones       <NA>
## 13    Ronnie      Wood The Rolling Stones       <NA>

A second join

Notice how the last column is pulled in as last.x and last.y if not included in the join
- And how all combination of rows are created for each name that is duplicated, e.g. john

left_join(bands, artists, by = "first")

## # A tibble: 16 x 5
##        first    last.x               band    last.y instrument
##        <chr>     <chr>              <chr>     <chr>      <chr>
##  1      John    Bonham       Led Zeppelin    Lennon     Guitar
##  2 John Paul     Jones       Led Zeppelin      <NA>       <NA>
##  3     Jimmy      Page       Led Zeppelin   Buffett     Guitar
##  4     Jimmy      Page       Led Zeppelin      Page     Guitar
##  5    Robert     Plant       Led Zeppelin      <NA>       <NA>
##  6    George  Harrison        The Beatles  Harrison     Guitar
##  7      John    Lennon        The Beatles    Lennon     Guitar
##  8      Paul McCartney        The Beatles McCartney       Bass
##  9      Paul McCartney        The Beatles     Simon     Guitar
## 10     Ringo     Starr        The Beatles     Starr      Drums
## 11     Jimmy   Buffett  The Coral Reefers   Buffett     Guitar
## 12     Jimmy   Buffett  The Coral Reefers      Page     Guitar
## 13      Mick    Jagger The Rolling Stones    Jagger     Vocals
## 14     Keith  Richards The Rolling Stones  Richards     Guitar
## 15   Charlie     Watts The Rolling Stones      <NA>       <NA>
## 16    Ronnie      Wood The Rolling Stones      <NA>       <NA>

# Fix the code to recreate bands3
left_join(bands, artists, by = c("first","last"))

## # A tibble: 13 x 4
##        first      last               band instrument
##        <chr>     <chr>              <chr>      <chr>
##  1      John    Bonham       Led Zeppelin       <NA>
##  2 John Paul     Jones       Led Zeppelin       <NA>
##  3     Jimmy      Page       Led Zeppelin     Guitar
##  4    Robert     Plant       Led Zeppelin       <NA>
##  5    George  Harrison        The Beatles     Guitar
##  6      John    Lennon        The Beatles     Guitar
##  7      Paul McCartney        The Beatles       Bass
##  8     Ringo     Starr        The Beatles      Drums
##  9     Jimmy   Buffett  The Coral Reefers     Guitar
## 10      Mick    Jagger The Rolling Stones     Vocals
## 11     Keith  Richards The Rolling Stones     Guitar
## 12   Charlie     Watts The Rolling Stones       <NA>
## 13    Ronnie      Wood The Rolling Stones       <NA>

A right join

# Finish the code below to recreate bands3 with a right join
bands2 <- left_join(bands, artists, by = c("first", "last"))

bands3 <- right_join(artists, bands, by = c("first", "last"))

# Check that bands2 is equal to bands3
head(bands2)

## # A tibble: 6 x 4
##       first     last         band instrument
##       <chr>    <chr>        <chr>      <chr>
## 1      John   Bonham Led Zeppelin       <NA>
## 2 John Paul    Jones Led Zeppelin       <NA>
## 3     Jimmy     Page Led Zeppelin     Guitar
## 4    Robert    Plant Led Zeppelin       <NA>
## 5    George Harrison  The Beatles     Guitar
## 6      John   Lennon  The Beatles     Guitar

head(bands3)

## # A tibble: 6 x 4
##       first     last instrument         band
##       <chr>    <chr>      <chr>        <chr>
## 1      John   Bonham       <NA> Led Zeppelin
## 2 John Paul    Jones       <NA> Led Zeppelin
## 3     Jimmy     Page     Guitar Led Zeppelin
## 4    Robert    Plant       <NA> Led Zeppelin
## 5    George Harrison     Guitar  The Beatles
## 6      John   Lennon     Guitar  The Beatles

setequal(bands2, bands3)

## TRUE

Variations on joins

left_join - prioritizes left dataset
right_join - prioritizes right dataset
inner_join - only retains rows in both datasets
full_join - retains all rows
Use %>% (pipes) to string together these joins

inner joins and full joins

head(songs)

## # A tibble: 4 x 4
##               song                album  first      last
##              <chr>                <chr>  <chr>     <chr>
## 1    Come Together           Abbey Road   John    Lennon
## 2         Dream On            Aerosmith Steven     Tyler
## 3   Hello, Goodbye Magical Mystery Tour   Paul McCartney
## 4 It's Not Unusual     Along Came Jones    Tom     Jones

head(albums)

## # A tibble: 6 x 3
##                       album               band  year
##                       <chr>              <chr> <int>
## 1        A Hard Day's Night        The Beatles  1964
## 2      Magical Mystery Tour        The Beatles  1967
## 3          Beggar's Banquet The Rolling Stones  1968
## 4                Abbey Road        The Beatles  1969
## 5           Led Zeppelin IV       Led Zeppelin  1971
## 6 The Dark Side of the Moon         Pink Floyd  1973

inner_join(songs, albums, by = "album")

## # A tibble: 3 x 6
##             song                album  first      last        band  year
##            <chr>                <chr>  <chr>     <chr>       <chr> <int>
## 1  Come Together           Abbey Road   John    Lennon The Beatles  1969
## 2       Dream On            Aerosmith Steven     Tyler   Aerosmith  1973
## 3 Hello, Goodbye Magical Mystery Tour   Paul McCartney The Beatles  1967

# Join bands to artists using full_join()
head(artists)

## # A tibble: 6 x 3
##    first     last instrument
##    <chr>    <chr>      <chr>
## 1  Jimmy  Buffett     Guitar
## 2 George Harrison     Guitar
## 3   Mick   Jagger     Vocals
## 4    Tom    Jones     Vocals
## 5   Davy    Jones     Vocals
## 6   John   Lennon     Guitar

head(bands)

## # A tibble: 6 x 3
##       first     last         band
##       <chr>    <chr>        <chr>
## 1      John   Bonham Led Zeppelin
## 2 John Paul    Jones Led Zeppelin
## 3     Jimmy     Page Led Zeppelin
## 4    Robert    Plant Led Zeppelin
## 5    George Harrison  The Beatles
## 6      John   Lennon  The Beatles

full_join(artists, bands, by = c("first","last"))

## # A tibble: 21 x 4
##     first      last instrument               band
##     <chr>     <chr>      <chr>              <chr>
##  1  Jimmy   Buffett     Guitar  The Coral Reefers
##  2 George  Harrison     Guitar        The Beatles
##  3   Mick    Jagger     Vocals The Rolling Stones
##  4    Tom     Jones     Vocals               <NA>
##  5   Davy     Jones     Vocals               <NA>
##  6   John    Lennon     Guitar        The Beatles
##  7   Paul McCartney       Bass        The Beatles
##  8  Jimmy      Page     Guitar       Led Zeppelin
##  9    Joe     Perry     Guitar               <NA>
## 10  Elvis   Presley     Vocals               <NA>
## # ... with 11 more rows

Pipes

# Find guitarists in bands dataset (don't change)
temp <- left_join(bands, artists, by = c("first", "last"))
temp <- filter(temp, instrument == "Guitar")
select(temp, first, last, band)

## # A tibble: 5 x 3
##    first     last               band
##    <chr>    <chr>              <chr>
## 1  Jimmy     Page       Led Zeppelin
## 2 George Harrison        The Beatles
## 3   John   Lennon        The Beatles
## 4  Jimmy  Buffett  The Coral Reefers
## 5  Keith Richards The Rolling Stones

# Reproduce code above using pipes
bands %>% 
  left_join(artists, by = c("first","last")) %>%
  filter(instrument == "Guitar") %>%
  select(first, last, band)

## # A tibble: 5 x 3
##    first     last               band
##    <chr>    <chr>              <chr>
## 1  Jimmy     Page       Led Zeppelin
## 2 George Harrison        The Beatles
## 3   John   Lennon        The Beatles
## 4  Jimmy  Buffett  The Coral Reefers
## 5  Keith Richards The Rolling Stones

Practice with pipes and joins

# Examine the contents of the goal dataset
goal

## # A tibble: 3 x 6
##   first      last instrument        band             song
##   <chr>     <chr>      <chr>       <chr>            <chr>
## 1   Tom     Jones     Vocals        <NA> It's Not Unusual
## 2  John    Lennon     Guitar The Beatles    Come Together
## 3  Paul McCartney       Bass The Beatles   Hello, Goodbye
## # ... with 1 more variables: album <chr>

# Create goal2 using full_join() and inner_join() 
head(artists)

## # A tibble: 6 x 3
##    first     last instrument
##    <chr>    <chr>      <chr>
## 1  Jimmy  Buffett     Guitar
## 2 George Harrison     Guitar
## 3   Mick   Jagger     Vocals
## 4    Tom    Jones     Vocals
## 5   Davy    Jones     Vocals
## 6   John   Lennon     Guitar

head(bands)

## # A tibble: 6 x 3
##       first     last         band
##       <chr>    <chr>        <chr>
## 1      John   Bonham Led Zeppelin
## 2 John Paul    Jones Led Zeppelin
## 3     Jimmy     Page Led Zeppelin
## 4    Robert    Plant Led Zeppelin
## 5    George Harrison  The Beatles
## 6      John   Lennon  The Beatles

head(songs)

## # A tibble: 4 x 4
##               song                album  first      last
##              <chr>                <chr>  <chr>     <chr>
## 1    Come Together           Abbey Road   John    Lennon
## 2         Dream On            Aerosmith Steven     Tyler
## 3   Hello, Goodbye Magical Mystery Tour   Paul McCartney
## 4 It's Not Unusual     Along Came Jones    Tom     Jones

goal2 <- artists %>%
  full_join(bands, by = c("first","last")) %>%
  inner_join(songs, by = c("first","last"))
  
# Check that goal and goal2 are the same
setequal(goal, goal2)

## TRUE

Choose your joins

# Create one table that combines all information
collection <- artists %>%
  full_join(bands, by = c("first","last")) %>%
  full_join(songs, by = c("first","last")) %>%
  full_join(albums, by = c("album", "band"))
  
collection

## # A tibble: 29 x 7
##     first      last instrument               band             song
##     <chr>     <chr>      <chr>              <chr>            <chr>
##  1  Jimmy   Buffett     Guitar  The Coral Reefers             <NA>
##  2 George  Harrison     Guitar        The Beatles             <NA>
##  3   Mick    Jagger     Vocals The Rolling Stones             <NA>
##  4    Tom     Jones     Vocals               <NA> It's Not Unusual
##  5   Davy     Jones     Vocals               <NA>             <NA>
##  6   John    Lennon     Guitar        The Beatles    Come Together
##  7   Paul McCartney       Bass        The Beatles   Hello, Goodbye
##  8  Jimmy      Page     Guitar       Led Zeppelin             <NA>
##  9    Joe     Perry     Guitar               <NA>             <NA>
## 10  Elvis   Presley     Vocals               <NA>             <NA>
## # ... with 19 more rows, and 2 more variables: album <chr>, year <int>

Filtering joins and set operations

Filtering joins return a copy of the dataset that has been filtered, not augmented (as with mutating joins)

Semi-joins

Apply a semi-join

# View the output of semi_join()
artists %>% 
  semi_join(songs, by = c("first", "last"))

## # A tibble: 3 x 3
##   first      last instrument
##   <chr>     <chr>      <chr>
## 1  John    Lennon     Guitar
## 2  Paul McCartney       Bass
## 3   Tom     Jones     Vocals

# Create the same result
artists %>% 
  right_join(songs, by = c("first", "last")) %>% 
  filter(!is.na(instrument)) %>% 
  select(first, last, instrument)

## # A tibble: 3 x 3
##   first      last instrument
##   <chr>     <chr>      <chr>
## 1  John    Lennon     Guitar
## 2  Paul McCartney       Bass
## 3   Tom     Jones     Vocals

Exploring with semi-joins

albums %>% 
  # Collect the albums made by a band
  semi_join(bands, by = "band") %>% 
  # Count the albums made by a band
  nrow()

## [1] 5

A more precise way to filter?

You can use semi_joins with a df of date you want to match to easily filter

tracks

## # A tibble: 21 x 6
##                               track                band      label  year
##                               <chr>               <chr>      <chr> <int>
##  1                Can't Buy Me Love         The Beatles Parlophone  1964
##  2                      I Feel Fine         The Beatles Parlophone  1964
##  3               A Hard Day's Night         The Beatles Parlophone  1964
##  4                 Sound of Silence Simon and Garfunkel   Columbia  1964
##  5                            Help!         The Beatles Parlophone  1965
##  6                   Ticket to Ride         The Beatles Parlophone  1965
##  7                      I am a Rock Simon and Garfunkel   Columbia  1965
##  8 Yellow Submarine / Eleanor Rigby         The Beatles Parlophone  1966
##  9                   Homeward Bound Simon and Garfunkel   Columbia  1966
## 10                 Scarborough Fair Simon and Garfunkel   Columbia  1966
## # ... with 11 more rows, and 2 more variables: first <chr>, last <chr>

matches

## # A tibble: 3 x 3
##                  band  year first
##                 <chr> <int> <chr>
## 1         The Beatles  1964  Paul
## 2         The Beatles  1965  John
## 3 Simon and Garfunkel  1966  Paul

# With semi-join
tracks %>% 
  semi_join(matches,by = c("band", "year", "first"))

## # A tibble: 4 x 6
##               track                band      label  year first      last
##               <chr>               <chr>      <chr> <int> <chr>     <chr>
## 1 Can't Buy Me Love         The Beatles Parlophone  1964  Paul McCartney
## 2             Help!         The Beatles Parlophone  1965  John    Lennon
## 3    Ticket to Ride         The Beatles Parlophone  1965  John    Lennon
## 4    Homeward Bound Simon and Garfunkel   Columbia  1966  Paul     Simon

# With dply filter statement
tracks %>% 
  filter(
    (band == "The Beatles" & year == 1964 & first == "Paul") |
    (band == "The Beatles" & year == 1965 & first == "John") |
    (band == "Simon and Garfunkel" & year == 1966 & first == "Paul")
  )

## # A tibble: 4 x 6
##               track                band      label  year first      last
##               <chr>               <chr>      <chr> <int> <chr>     <chr>
## 1 Can't Buy Me Love         The Beatles Parlophone  1964  Paul McCartney
## 2             Help!         The Beatles Parlophone  1965  John    Lennon
## 3    Ticket to Ride         The Beatles Parlophone  1965  John    Lennon
## 4    Homeward Bound Simon and Garfunkel   Columbia  1966  Paul     Simon

Anti-joins

Apply an anti-join

# Return rows of artists that don't have bands info
artists %>% 
  anti_join(bands, by = c("first","last"))

## # A tibble: 8 x 3
##   first    last instrument
##   <chr>   <chr>      <chr>
## 1 Nancy  Wilson     Vocals
## 2 Brian  Wilson     Vocals
## 3   Joe   Walsh     Guitar
## 4  Paul   Simon     Guitar
## 5 Elvis Presley     Vocals
## 6   Joe   Perry     Guitar
## 7  Davy   Jones     Vocals
## 8   Tom   Jones     Vocals

Apply another anti-join

# Check whether album names in labels are mis-entered
glimpse(labels)

## Observations: 9
## Variables: 2
## $ album <chr> "Abbey Road", "A Hard Days Night", "Magical Mystery Tour...
## $ label <chr> "Apple", "Parlophone", "Parlophone", "Atlantic", "Harves...

glimpse(albums)

## Observations: 9
## Variables: 3
## $ album <chr> "A Hard Day's Night", "Magical Mystery Tour", "Beggar's ...
## $ band  <chr> "The Beatles", "The Beatles", "The Rolling Stones", "The...
## $ year  <int> 1964, 1967, 1968, 1969, 1971, 1973, 1973, 1977, 1982

labels %>% 
  anti_join(albums, by = c("album"))

## # A tibble: 1 x 2
##               album      label
##               <chr>      <chr>
## 1 A Hard Days Night Parlophone

Which filtering join?

# Determine which key joins labels and songs
labels

## # A tibble: 9 x 2
##                       album           label
##                       <chr>           <chr>
## 1                Abbey Road           Apple
## 2         A Hard Days Night      Parlophone
## 3      Magical Mystery Tour      Parlophone
## 4           Led Zeppelin IV        Atlantic
## 5 The Dark Side of the Moon         Harvest
## 6          Hotel California          Asylum
## 7                   Rumours Warner Brothers
## 8                 Aerosmith        Columbia
## 9          Beggar's Banquet           Decca

songs

## # A tibble: 4 x 4
##               song                album  first      last
##              <chr>                <chr>  <chr>     <chr>
## 1    Come Together           Abbey Road   John    Lennon
## 2         Dream On            Aerosmith Steven     Tyler
## 3   Hello, Goodbye Magical Mystery Tour   Paul McCartney
## 4 It's Not Unusual     Along Came Jones    Tom     Jones

# Check your understanding
songs %>% 
  # Find the rows of songs that match a row in labels
  semi_join(labels, by = c("album")) %>% 
  # Number of matches between labels and songs
  nrow()

## [1] 3

Set operations

union will return every row that appears in one or more of the datasets
- If a row appears multiple times union will only return it once
interesect will return only the rows that appear in both datasets
setdiff will return the rows that appear in the first dataset but not the second

How many songs are there?

How many unique songs do these two albums contain in total?

head(aerosmith)

## # A tibble: 6 x 2
##                song     length
##               <chr>     <time>
## 1           Make It 13260 secs
## 2          Somebody 13500 secs
## 3          Dream On 16080 secs
## 4    One Way Street 25200 secs
## 5          Mama Kin 15900 secs
## 6 Write me a Letter 15060 secs

head(greatest_hits)

## # A tibble: 6 x 2
##                      song     length
##                     <chr>     <time>
## 1                Dream On 16080 secs
## 2                Mama Kin 16020 secs
## 3 Same Old Song and Dance 11040 secs
## 4       Seasons of Winter 17820 secs
## 5           Sweet Emotion 11700 secs
## 6           Walk this Way 12780 secs

aerosmith %>% 
  # Create the new dataset using a set operation
  union(greatest_hits) %>% 
  # Count the total number of songs
  nrow()

## [1] 24

Greatest hits

Which songs from Aerosmith made it onto Greates Hits?

# Create the new dataset using a set operation
aerosmith %>% 
  intersect(greatest_hits)

## # A tibble: 1 x 2
##       song     length
##      <chr>     <time>
## 1 Dream On 16080 secs

Live! Bootleg songs

Which songs are on Live! Bootleg but not on Greatest Hits? Notice that the length of songs may be different when they are performed live.

# Select the song names from live
live_songs <- live %>% 
  select(song)

# Select the song names from greatest_hits
greatest_songs <- greatest_hits %>% 
  select(song)

# Create the new dataset using a set operation
live_songs %>% 
  setdiff(greatest_songs)

## # A tibble: 8 x 1
##                                          song
##                                         <chr>
## 1                          Lord of the Thighs
## 2                           Toys in the Attic
## 3                               Sick as a Dog
## 4                         Sight for Sore Eyes
## 5                            S.O.S. (Too Bad)
## 6                             I Ain't Got You
## 7                Mother Popcorn/Draw the Line
## 8 Train Kept A-Rollin'/Strangers in the Night

Multiple operations

Which songs appear on one of Live! Bootleg or Greatest Hits, but not both?

# Select songs from live and greatest_hits
live_songs <- live %>% select(song)
greatest_songs <- greatest_hits %>% select(song)

# Return the songs that only exist in one dataset
all_songs <- union(live_songs, greatest_songs)
both_songs <- intersect(live_songs, greatest_songs)
one_songs <- setdiff(all_songs, both_songs)
one_songs

## # A tibble: 17 x 1
##                                           song
##                                          <chr>
##  1                       One Way Street (live)
##  2                       Sweet Emotion (remix)
##  3                           Lightning Strikes
##  4              Remember (Walking in the Sand)
##  5                            Kings and Queens
##  6                               Draw the Line
##  7                         Big Ten Inch Record
##  8                           Seasons of Winter
##  9                     Same Old Song and Dance
## 10 Train Kept A-Rollin'/Strangers in the Night
## 11                Mother Popcorn/Draw the Line
## 12                             I Ain't Got You
## 13                            S.O.S. (Too Bad)
## 14                         Sight for Sore Eyes
## 15                               Sick as a Dog
## 16                           Toys in the Attic
## 17                          Lord of the Thighs

Comparing datasets

Its common to want to know if one data set is the same as another dataset
dplyr’s setequal will do this easily
base R’s identical is will only return true if the datasets have the exact same rows in the exact same order

Recap:

Mutating Joins:
- left_join
- right_join
- inner_join
- full_join
Filtering Joins:
- semi_join
- anti_join
Set Operations:
- union
- intersect
- setdiff
Comparisions
- setequal

Apply setequal

Both anthologies claim to contain the complete studio recordings of Led Zeppelin, but do the anthologies contain the same exact songs?

# Check if same order: definitive and complete
identical(definitive, complete)

## [1] FALSE

# Check if any order: definitive and complete
setequal(definitive, complete)

## FALSE: Different number of rows

# Songs in definitive but not complete
setdiff(definitive, complete)

## # A tibble: 15 x 2
##                           song                     album
##                          <chr>                     <chr>
##  1               Rock and Roll The Song Remains the Same
##  2             Celebration Day The Song Remains the Same
##  3                   Black Dog The Song Remains the Same
##  4 Over the Hills and Far Away The Song Remains the Same
##  5          Misty Mountain Hop The Song Remains the Same
##  6  Since I've Been Loving You The Song Remains the Same
##  7                  No Quarter The Song Remains the Same
##  8   The Song Remains the Same The Song Remains the Same
##  9               The Rain Song The Song Remains the Same
## 10                   The Ocean The Song Remains the Same
## 11          Dazed and Confused The Song Remains the Same
## 12          Stairway to Heaven The Song Remains the Same
## 13                   Moby Dick The Song Remains the Same
## 14                Heartbreaker The Song Remains the Same
## 15            Whole Lotta Love The Song Remains the Same

# Songs in complete but not definitive
setdiff(complete, definitive)

## # A tibble: 0 x 2
## # ... with 2 variables: song <chr>, album <chr>

Apply setequal again

If the datasets have the same columns and all are used in the key, then intersect is analagous to semi_join
In the same scenario setdiff is similar to anti_join

# Return songs in definitive that are not in complete
definitive %>% 
  anti_join(complete, by = c("song", "album"))

## # A tibble: 15 x 2
##                           song                     album
##                          <chr>                     <chr>
##  1            Whole Lotta Love The Song Remains the Same
##  2                Heartbreaker The Song Remains the Same
##  3                   Moby Dick The Song Remains the Same
##  4          Stairway to Heaven The Song Remains the Same
##  5          Dazed and Confused The Song Remains the Same
##  6                   The Ocean The Song Remains the Same
##  7               The Rain Song The Song Remains the Same
##  8   The Song Remains the Same The Song Remains the Same
##  9                  No Quarter The Song Remains the Same
## 10  Since I've Been Loving You The Song Remains the Same
## 11          Misty Mountain Hop The Song Remains the Same
## 12 Over the Hills and Far Away The Song Remains the Same
## 13                   Black Dog The Song Remains the Same
## 14             Celebration Day The Song Remains the Same
## 15               Rock and Roll The Song Remains the Same

# Return songs in complete that are not in definitive
complete %>% 
  anti_join(definitive, by = c("song", "album"))

## # A tibble: 0 x 2
## # ... with 2 variables: song <chr>, album <chr>

Comparing albums

Is the soundtrack of The Song Remains the Same the only difference between The Definitive Collection and The Complete Studio Recordings?

nrow(complete)

## [1] 85

nrow(definitive)

## [1] 100

# Check if same order: definitive and union of complete and soundtrack
union(complete, soundtrack) %>%
  identical(definitive)

## [1] FALSE

# Check if any order: definitive and union of complete and soundtrack
union(complete, soundtrack) %>%
  setequal(definitive)

## TRUE

Assembling data

Binds

Base R binds
- rbind, cbind
dplyr binds
- bind_rows, bind_cols

Advantages of dplyr versions
- faster
- return a tibble (advanced data frame)
- can handle lists of dataframes
- .id argument in bind_rows allows you to pass in a name for each source dataframe
- this will be added as a column to the result dataset
- this is really handy when one of the variables for the data set is stored in the table name.
- e.g. houly weather data for different days
- rbind will return an error if the column names do not match exactly.
- bind_rows will create a column for each unique column and distribute NAs appropriately

Which bind?

# Examine side_one and side_two
side_one

## # A tibble: 5 x 2
##                       song     length
##                      <chr>     <time>
## 1              Speak to Me  5400 secs
## 2                  Breathe  9780 secs
## 3               On the Run 12600 secs
## 4                     Time 24780 secs
## 5 The Great Gig in the Sky 15300 secs

side_two

## # A tibble: 5 x 2
##                  song     length
##                 <chr>     <time>
## 1               Money 23400 secs
## 2         Us and Them 28260 secs
## 3 Any Colour You Like 12240 secs
## 4        Brain Damage 13800 secs
## 5             Eclipse  7380 secs

# Bind side_one and side_two into a single dataset
side_one %>% 
  bind_rows(side_two)

## # A tibble: 10 x 2
##                        song     length
##                       <chr>     <time>
##  1              Speak to Me  5400 secs
##  2                  Breathe  9780 secs
##  3               On the Run 12600 secs
##  4                     Time 24780 secs
##  5 The Great Gig in the Sky 15300 secs
##  6                    Money 23400 secs
##  7              Us and Them 28260 secs
##  8      Any Colour You Like 12240 secs
##  9             Brain Damage 13800 secs
## 10                  Eclipse  7380 secs

Bind rows

you can pass bind_rows() a list of data frames like jimi to bind together into a single data frame
you can save the name of the list element into a column with the .id argument

# Examine discography and jimi
discography

## # A tibble: 3 x 2
##                 album  year
##                 <chr> <int>
## 1 Are You Experienced  1967
## 2  Axis: Bold as Love  1967
## 3   Electric Ladyland  1968

jimi

## $`Are You Experienced`
## # A tibble: 10 x 2
##                        song     length
##                       <chr>     <time>
##  1              Purple Haze  9960 secs
##  2         Manic Depression 13560 secs
##  3                  Hey Joe 12180 secs
##  4         May This Be Love 11640 secs
##  5       I Don't Live Today 14100 secs
##  6      The Wind Cries Mary 12060 secs
##  7                     Fire  9240 secs
##  8 Third Stone from the Sun 24000 secs
##  9                Foxy Lady 11700 secs
## 10     Are You Experienced? 14100 secs
## 
## $`Axis: Bold As Love`
## # A tibble: 13 x 2
##                    song     length
##                   <chr>     <time>
##  1                  EXP  6900 secs
##  2    Up from the Skies 10500 secs
##  3 Spanish Castle Magic 10800 secs
##  4  Wait Until Tomorrow 10800 secs
##  5     Ain't No Telling  6360 secs
##  6          Little Wing  8640 secs
##  7           If 6 was 9 19920 secs
##  8   You Got Me Floatin  9900 secs
##  9 Castles Made of Sand  9960 secs
## 10        She's So Fine  9420 secs
## 11       One Rainy Wish 13200 secs
## 12    Little Miss Lover  8400 secs
## 13         Bold as Love 15060 secs
## 
## $`Electric Ladyland`
## # A tibble: 16 x 2
##                                          song     length
##                                         <chr>     <time>
##  1                     And the Gods Made Love  4860 secs
##  2  Have You Ever Been (To Electric Ladyland)  7860 secs
##  3                          Crosstown Traffic  8700 secs
##  4                               Voodoo Chile 54000 secs
##  5                        Little Miss Strange 10320 secs
##  6                      Long Hot Summer Night 12420 secs
##  7                           Come On (Part 1) 14940 secs
##  8                                 Gypsy Eyes 13380 secs
##  9               Burning of the Midnight Lamp 13140 secs
## 10                      Rainy Day, Dream Away 13320 secs
## 11     1983... (A Merman I Should Turn to Be) 49140 secs
## 12 Moon, Turn the Tides... Gently Gently Away  3720 secs
## 13              Still Raining, Still Dreaming 15900 secs
## 14                         House Burning Down 16380 secs
## 15                   All Along the Watchtower 14460 secs
## 16               Voodoo Child (Slight Return) 18720 secs

jimi %>% 
  # Bind jimi into a single data frame
  bind_rows(.id = "album") %>% 
  # Make a complete data frame
  left_join(discography, by = c("album"))

## # A tibble: 39 x 4
##                  album                     song     length  year
##                  <chr>                    <chr>     <time> <int>
##  1 Are You Experienced              Purple Haze  9960 secs  1967
##  2 Are You Experienced         Manic Depression 13560 secs  1967
##  3 Are You Experienced                  Hey Joe 12180 secs  1967
##  4 Are You Experienced         May This Be Love 11640 secs  1967
##  5 Are You Experienced       I Don't Live Today 14100 secs  1967
##  6 Are You Experienced      The Wind Cries Mary 12060 secs  1967
##  7 Are You Experienced                     Fire  9240 secs  1967
##  8 Are You Experienced Third Stone from the Sun 24000 secs  1967
##  9 Are You Experienced                Foxy Lady 11700 secs  1967
## 10 Are You Experienced     Are You Experienced? 14100 secs  1967
## # ... with 29 more rows

Bind columns

# Examine hank_years and hank_charts
hank_years

## # A tibble: 67 x 2
##     year                                    song
##    <int>                                   <chr>
##  1  1947                         Move It On Over
##  2  1947    My Love for You (Has Turned to Hate)
##  3  1947 Never Again (Will I Knock on Your Door)
##  4  1947    On the Banks of the Old Ponchartrain
##  5  1947                            Pan American
##  6  1947             Wealth Won't Save Your Soul
##  7  1948                   A Mansion on the Hill
##  8  1948                           Honky Tonkin'
##  9  1948                         I Saw the Light
## 10  1948                   I'm a Long Gone Daddy
## # ... with 57 more rows

hank_charts

## # A tibble: 67 x 2
##                               song  peak
##                              <chr> <int>
##  1 (I Heard That) Lonesome Whistle     9
##  2    (I'm Gonna) Sing, Sing, Sing    NA
##  3                A Home in Heaven    NA
##  4           A Mansion on the Hill    12
##  5            A Teardrop on a Rose    NA
##  6       At the First Fall of Snow    NA
##  7      Baby, We're Really in Love     4
##  8               California Zephyr    NA
##  9                     Calling You    NA
## 10                Cold, Cold Heart     1
## # ... with 57 more rows

hank_years %>% 
  # Reorder hank_years alphabetically by song title
  arrange(song) %>% 
  # Select just the year column
  select(year) %>% 
  # Bind the year column
  bind_cols(hank_charts) %>% 
  # Arrange the finished dataset
  arrange(year, song)

## # A tibble: 67 x 3
##     year                                    song  peak
##    <int>                                   <chr> <int>
##  1  1947                         Move It On Over     4
##  2  1947    My Love for You (Has Turned to Hate)    NA
##  3  1947 Never Again (Will I Knock on Your Door)    NA
##  4  1947    On the Banks of the Old Ponchartrain    NA
##  5  1947                            Pan American    NA
##  6  1947             Wealth Won't Save Your Soul    NA
##  7  1948                   A Mansion on the Hill    12
##  8  1948                           Honky Tonkin'    14
##  9  1948                         I Saw the Light    NA
## 10  1948                   I'm a Long Gone Daddy     6
## # ... with 57 more rows

Danger

The biggest risk when using bind_cols is that the observations may not align acros the rows of your data frame

Build a better data frame

R Base way to create a dataframe
- data.frame() and as.data.frame()
dplyr way to create a dataframe
- data_frame() and as_data_frame()
Advantages of using data_frame():
- will never change the data type of a vector. (e.g. strings to factors)
- will never add row names
- will not changes unusual column names
- only recycles length 1 inputs
- evaluate lazily and in order. (So you can reference a column in the next column definition)
- outputs a tibble (class tbl_df)
as_data_frame does all these same things
- its just used on a list of columns

Make a data frame

head(hank_year)

## [1] 1947 1947 1947 1947 1947 1947

head(hank_song)

## [1] "Move It On Over"                        
## [2] "My Love for You (Has Turned to Hate)"   
## [3] "Never Again (Will I Knock on Your Door)"
## [4] "On the Banks of the Old Ponchartrain"   
## [5] "Pan American"                           
## [6] "Wealth Won't Save Your Soul"

head(hank_peak)

## [1]  4 NA NA NA NA NA

# Make combined data frame using data_frame()
data_frame(
  year = hank_year,
  song = hank_song,
  peak = hank_peak
  ) %>% 
  # Extract songs where peak equals 1
  filter(peak == 1)

## # A tibble: 11 x 3
##     year                                   song  peak
##    <int>                                  <chr> <int>
##  1  1949                         Lovesick Blues     1
##  2  1950               Long Gone Lonesome Blues     1
##  3  1950                      Moanin' the Blues     1
##  4  1950                  Why Don't You Love Me     1
##  5  1951                       Cold, Cold Heart     1
##  6  1951                       Hey Good Lookin'     1
##  7  1952 I'll Never Get Out of This World Alive     1
##  8  1952               Jambalaya (On the Bayou)     1
##  9  1953                               Kaw-Liga     1
## 10  1953        Take These Chains from My Heart     1
## 11  1953                    Your Cheatin' Heart     1

Lists of columns

# Examine the contents of hank
str(hank)

## List of 3
##  $ year: int [1:67] 1947 1947 1947 1947 1947 1947 1948 1948 1948 1948 ...
##  $ song: chr [1:67] "Move It On Over" "My Love for You (Has Turned to Hate)" "Never Again (Will I Knock on Your Door)" "On the Banks of the Old Ponchartrain" ...
##  $ peak: int [1:67] 4 NA NA NA NA NA 12 14 NA 2 ...

# Convert the hank list into a data frame
as_data_frame(hank) %>% 
  # Extract songs where peak equals 1
  filter(peak == 1)

## # A tibble: 11 x 3
##     year                                   song  peak
##    <int>                                  <chr> <int>
##  1  1949                         Lovesick Blues     1
##  2  1950               Long Gone Lonesome Blues     1
##  3  1950                      Moanin' the Blues     1
##  4  1950                  Why Don't You Love Me     1
##  5  1951                       Cold, Cold Heart     1
##  6  1951                       Hey Good Lookin'     1
##  7  1952 I'll Never Get Out of This World Alive     1
##  8  1952               Jambalaya (On the Bayou)     1
##  9  1953                               Kaw-Liga     1
## 10  1953        Take These Chains from My Heart     1
## 11  1953                    Your Cheatin' Heart     1

List of rows (data frames)

# Examine the contents of michael
str(michael, give.attr = F)

## List of 10
##  $ Got to Be There :Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "Ain'T No Sunshine" "I Wanna be Where You Are" "Girl Don't Take Your Love from Me" "In Our Small Way" ...
##   ..$ peak: int [1:10] NA NA NA NA 4 2 NA NA NA NA
##  $ Ben             :Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "Ben" "Greatest Show On Earth" "People Make the World Go 'Round" "We've Got a Good Thing" ...
##   ..$ peak: int [1:10] 1 NA NA NA NA NA NA NA NA NA
##  $ Music & Me      :Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "With a Child's Heart" "Up Again" "All the Things You Are" "Happy" ...
##   ..$ peak: int [1:10] 50 NA NA NA NA NA NA NA NA NA
##  $ Forever, Michael:Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "We're Almost There" "Take Me Back" "One Day in Your Life" "Cinderella Stay Awhile" ...
##   ..$ peak: int [1:10] 54 NA NA NA NA 23 NA NA NA NA
##  $ Off the Wall    :Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "Don't Stop 'Til You Get Enough" "Rock with You" "Working Day and Night" "Get on the Floor" ...
##   ..$ peak: int [1:10] 1 1 NA NA 10 NA 10 NA NA NA
##  $ Thriller        :Classes 'tbl_df', 'tbl' and 'data.frame':    9 obs. of  2 variables:
##   ..$ song: chr [1:9] "Wanna Be Startin' Somethin" "Baby be Mine" "The Girl is Mine" "Thriller" ...
##   ..$ peak: int [1:9] 5 NA 2 4 1 1 7 10 NA
##  $ Bad             :Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "Bad" "The Way You Make Me Feel" "Speed Demon" "Liberian Girl" ...
##   ..$ peak: int [1:10] 1 1 NA NA NA 11 1 1 1 7
##  $ Dangerous       :Classes 'tbl_df', 'tbl' and 'data.frame':    14 obs. of  2 variables:
##   ..$ song: chr [1:14] "Jam" "Why You Wanna Trip on Me" "In the Closet" "She Drives Me Wild" ...
##   ..$ peak: int [1:14] 26 NA 6 NA 3 NA 27 1 14 NA ...
##  $ HIStory         :Classes 'tbl_df', 'tbl' and 'data.frame':    30 obs. of  2 variables:
##   ..$ song: chr [1:30] "Billie Jean" "The Way You Make Me Feel" "Black or White" "Rock with You" ...
##   ..$ peak: int [1:30] NA NA NA NA NA NA NA NA NA NA ...
##  $ Invincible      :Classes 'tbl_df', 'tbl' and 'data.frame':    16 obs. of  2 variables:
##   ..$ song: chr [1:16] "Unbreakable" "Heartbreaker" "Invincible" "Break of Dawn" ...
##   ..$ peak: int [1:16] NA NA NA NA NA 10 NA NA NA NA ...

bind_rows(michael, .id = "album") %>% 
  group_by(album) %>% 
  mutate(rank = min_rank(peak)) %>% 
  filter(rank == 1) %>% 
  select(-rank, -peak)

## # A tibble: 16 x 2
## # Groups:   album [10]
##               album                           song
##               <chr>                          <chr>
##  1  Got to Be There                  Rockin' Robin
##  2              Ben                            Ben
##  3       Music & Me           With a Child's Heart
##  4 Forever, Michael       Just a Little Bit of You
##  5     Off the Wall Don't Stop 'Til You Get Enough
##  6     Off the Wall                  Rock with You
##  7         Thriller                        Beat It
##  8         Thriller                    Billie Jean
##  9              Bad                            Bad
## 10              Bad       The Way You Make Me Feel
## 11              Bad              Man in the Mirror
## 12              Bad   I Just Can't Stop Loving You
## 13              Bad                    Dirty Diana
## 14        Dangerous                 Black or White
## 15          HIStory              You Are Not Alone
## 16       Invincible              You Rock My World

Working with data types

Be aware of data types when combining vectors or data frames
- R will need to make a decision about the resulting type if the original types differ for a vector/column
6 atomic data types in R
- logical
- character
- double
- integer
- complex
- raw

typeof(TRUE)

## [1] "logical"

typeof("hello")

## [1] "character"

typeof(3.14)

## [1] "double"

typeof(1L)

## [1] "integer"

typeof(1 + 2i)

## [1] "complex"

typeof(raw(1))

## [1] "raw"

Atomic data types

R will coerce logicals and numerics to strings if strings are present
R will coerce logicals to numerics if numerics are present
as.character() will coerce factors to their labels as strings if strings are present
as.numeric() will coerce factors to their levels as numerics if numerics are present

dplyr’s coercion rules

Base R coercion rules:

in general the more specific type of data will convert to the more general
- integer and logical will go to integer
- double and integer/locgical will go to double
- character and integer/double/logical will go to character
factors are tricky
- if joining to an integer or double the factor integer values will be used
- if joining to a character the factor labels will be used
- be careful if you have numeric labels for your factors

dplyr’s coercion rules:

dplyr functions will not automatically coerce data types
- returns and error
- expects you to manyally coerce data
Exception: factors
- dplyr coverts non-aligning factors to strings
- gives warning message

dplyr and coercion

dplyr’s two table functions will
- coerce two factors to a string if the factors have different levels
- coerce two factors to a string if the factors have the same levels in a different order
- return an error instead of coercing logicals, integers, and numerics to a character

Determining type

eighties has the year variable stores as a factor
sevnties has it stored as a character
when combined it will be a character
- we will get a warning message

seventies

## # A tibble: 10 x 3
##      year                      album                 band
##    <fctr>                      <chr>                <chr>
##  1   1970 Bridge Over Troubled Water  Simon and Garfunkel
##  2   1971     Jesus Christ Superstar      Various Artists
##  3   1972                    Harvest           Neil Young
##  4   1973      The World is a Ghetto                  War
##  5   1974  Goodbye Yellow Brick Road           Elton John
##  6   1975 Elton John's Greatest Hits           Elton John
##  7   1976             Peter Frampton Frampton Comes Alive
##  8   1977                    Rumours        Fleetwood Mac
##  9   1978       Saturday Night Fever             Bee Gees
## 10   1979                 Billy Joel          52nd Street

eighties

## # A tibble: 10 x 3
##     year              album              band
##    <chr>              <chr>             <chr>
##  1  1980           The Wall        Pink Floyd
##  2  1981      Hi Infidelity    REO Speedwagon
##  3  1982               Asia              Asia
##  4  1983           Thriller   Michael Jackson
##  5  1984           Thriller   Michael Jackson
##  6  1985 Born in the U.S.A. Bruce Springsteen
##  7  1986    Whitney Houston   Whitney Houston
##  8  1987  Slippery When Wet          Bon Jovi
##  9  1988              Faith    George Michael
## 10  1989     Don't Be Cruel       Bobby Brown

both <- seventies %>% bind_rows(eighties)

## Warning in bind_rows_(x, .id): binding factor and character vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

head(both)

## # A tibble: 6 x 3
##    year                      album                band
##   <chr>                      <chr>               <chr>
## 1  1970 Bridge Over Troubled Water Simon and Garfunkel
## 2  1971     Jesus Christ Superstar     Various Artists
## 3  1972                    Harvest          Neil Young
## 4  1973      The World is a Ghetto                 War
## 5  1974  Goodbye Yellow Brick Road          Elton John
## 6  1975 Elton John's Greatest Hits          Elton John

Results

# sixties$year is a numeric and seventies$year is a factor
str(sixties)

## Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  3 variables:
##  $ year : num  1960 1961 1962 1963 1964 ...
##  $ album: chr  "The Sound of Music" "Camelot" "West Side Story" "West Side Story" ...
##  $ band : chr  "Original Broadway Cast" "Original Broadway Cast" "Soundtrack" "Soundtrack" ...

str(seventies)

## Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  3 variables:
##  $ year : Factor w/ 10 levels "1970","1971",..: 1 2 3 4 5 6 7 8 9 10
##  $ album: chr  "Bridge Over Troubled Water" "Jesus Christ Superstar" "Harvest" "The World is a Ghetto" ...
##  $ band : chr  "Simon and Garfunkel" "Various Artists" "Neil Young" "War" ...

# This gives an error about not converting factor to numeric
seventies %>% bind_rows(sixties)

## Error in bind_rows_(x, .id): Column `year` can't be converted from factor to numeric

# So lets convert it....
# But this converts the year to the integer factor value, not the year. Whoops
seventies %>% 
  mutate(year = as.numeric(year))

## # A tibble: 10 x 3
##     year                      album                 band
##    <dbl>                      <chr>                <chr>
##  1     1 Bridge Over Troubled Water  Simon and Garfunkel
##  2     2     Jesus Christ Superstar      Various Artists
##  3     3                    Harvest           Neil Young
##  4     4      The World is a Ghetto                  War
##  5     5  Goodbye Yellow Brick Road           Elton John
##  6     6 Elton John's Greatest Hits           Elton John
##  7     7             Peter Frampton Frampton Comes Alive
##  8     8                    Rumours        Fleetwood Mac
##  9     9       Saturday Night Fever             Bee Gees
## 10    10                 Billy Joel          52nd Street

# We need to make it a character first to get the year, then convert to numeric
seventies %>% mutate(year = as.numeric(as.character(year)))

## # A tibble: 10 x 3
##     year                      album                 band
##    <dbl>                      <chr>                <chr>
##  1  1970 Bridge Over Troubled Water  Simon and Garfunkel
##  2  1971     Jesus Christ Superstar      Various Artists
##  3  1972                    Harvest           Neil Young
##  4  1973      The World is a Ghetto                  War
##  5  1974  Goodbye Yellow Brick Road           Elton John
##  6  1975 Elton John's Greatest Hits           Elton John
##  7  1976             Peter Frampton Frampton Comes Alive
##  8  1977                    Rumours        Fleetwood Mac
##  9  1978       Saturday Night Fever             Bee Gees
## 10  1979                 Billy Joel          52nd Street

# Now we can join the datasets correctly
seventies %>% 
  # Coerce seventies$year into a useful numeric
  mutate(year = as.numeric(as.character(year))) %>% 
  # Bind the updated version of seventies to sixties
  bind_rows(sixties) %>% 
  arrange(year)

## # A tibble: 20 x 3
##     year                          album                            band
##    <dbl>                          <chr>                           <chr>
##  1  1960             The Sound of Music          Original Broadway Cast
##  2  1961                        Camelot          Original Broadway Cast
##  3  1962                West Side Story                      Soundtrack
##  4  1963                West Side Story                      Soundtrack
##  5  1964                  Hello, Dolly!          Original Broadway Cast
##  6  1965                   Mary Poppins                      Soundtrack
##  7  1966 Whipped Cream & Other Delights Herb Alpert & The Tijuana Brass
##  8  1967            More of The Monkees                     The Monkees
##  9  1968           Are You Experienced?     The Jimi Hendrix Experience
## 10  1969             In-A-Gadda-Da-Vida                  Iron Butterfly
## 11  1970     Bridge Over Troubled Water             Simon and Garfunkel
## 12  1971         Jesus Christ Superstar                 Various Artists
## 13  1972                        Harvest                      Neil Young
## 14  1973          The World is a Ghetto                             War
## 15  1974      Goodbye Yellow Brick Road                      Elton John
## 16  1975     Elton John's Greatest Hits                      Elton John
## 17  1976                 Peter Frampton            Frampton Comes Alive
## 18  1977                        Rumours                   Fleetwood Mac
## 19  1978           Saturday Night Fever                        Bee Gees
## 20  1979                     Billy Joel                     52nd Street

Advanced Joining

What can go wrong?

This chapter is about becoming a master at joining data
The main problems with joins come from missing or duplicate key values or columns
NA in the key column?
- Probably just need to remove these rows
Missing key column?
- it might be in the row name. use rownames_to_column in tibble package
duplicate keys?
- Might just need to include another column in the key criteria
- If you really have duplicates be careful as you will get every combination of key value matches across datasets.

Spot the key

# Whoops, the primary key is in the row names for stage_songs
# We need to fix that before joining to stage_writers
stage_songs

##                                     musical year
## Children Will Listen         Into the Woods 1986
## Maria                       West Side Story 1957
## Memory                                 Cats 1981
## The Music of the Night Phantom of the Opera 1986

stage_writers

##                     song            composer
## 1   Children Will Listen    Stephen Sondheim
## 2                  Maria   Leonard Bernstein
## 3                 Memory Andrew Lloyd Webber
## 4 The Music of the Night Andrew Lloyd Webber

# Load the tibble package
library(tibble)

stage_songs %>% 
  # Add row names as a column named song
  rownames_to_column("song") %>% 
  # Left join stage_writers to stage_songs
  left_join(stage_writers, by = "song")

##                     song              musical year            composer
## 1   Children Will Listen       Into the Woods 1986    Stephen Sondheim
## 2                  Maria      West Side Story 1957   Leonard Bernstein
## 3                 Memory                 Cats 1981 Andrew Lloyd Webber
## 4 The Music of the Night Phantom of the Opera 1986 Andrew Lloyd Webber

Non-unique keys

Non unique keys can be okay if you want the info for each entry and don’t mind it being duplicated.
- like years will get duplicated here because there are non unique keys in the musical column of composers dataset

shows

## # A tibble: 6 x 2
##                musical  year
##                  <chr> <int>
## 1   The Sound of Music  1959
## 2       Into the Woods  1986
## 3       The King and I  1951
## 4      West Side Story  1957
## 5                 Cats  1981
## 6 Phantom of the Opera  1986

# There are more than one composer for some musicals, so the musical column has duplicate key values. 
composers

## # A tibble: 8 x 2
##                musical             composer
##                  <chr>                <chr>
## 1                 Cats  Andrew Lloyd Webber
## 2       Into the Woods     Stephen Sondheim
## 3 Phantom of the Opera  Andrew Lloyd Webber
## 4       The King and I       Richard Rogers
## 5       The King and I Oscar Hammerstein II
## 6   The Sound of Music       Richard Rogers
## 7   The Sound of Music Oscar Hammerstein II
## 8      West Side Story    Leonard Bernstein

# join by musical
shows %>% left_join(composers, by = "musical")

## # A tibble: 8 x 3
##                musical  year             composer
##                  <chr> <int>                <chr>
## 1   The Sound of Music  1959       Richard Rogers
## 2   The Sound of Music  1959 Oscar Hammerstein II
## 3       Into the Woods  1986     Stephen Sondheim
## 4       The King and I  1951       Richard Rogers
## 5       The King and I  1951 Oscar Hammerstein II
## 6      West Side Story  1957    Leonard Bernstein
## 7                 Cats  1981  Andrew Lloyd Webber
## 8 Phantom of the Opera  1986  Andrew Lloyd Webber

Two non-unique keys

But in this case its kind of a problem becuse the key is duplicated in both show_songs and composers datasets
- There are multiple songs in a musical and multiple composers for a musical
- We end up with two rows for every song if there were two composers for that musical

show_songs

## # A tibble: 8 x 3
##                          song              musical  year
##                         <chr>                <chr> <int>
## 1 A Few of My Favorite Things   The Sound of Music  1959
## 2        Children Will Listen       Into the Woods  1986
## 3                   Edelweiss   The Sound of Music  1959
## 4         Getting to Know You       The King and I  1951
## 5                       Maria      West Side Story  1957
## 6                      Memory                 Cats  1981
## 7             Shall We Dance?       The King and I  1951
## 8      The Music of the Night Phantom of the Opera  1986

show_songs %>% 
  left_join(composers, by = "musical") %>%
  arrange(musical) %>% 
  head(12)

## # A tibble: 12 x 4
##                           song              musical  year
##                          <chr>                <chr> <int>
##  1                      Memory                 Cats  1981
##  2        Children Will Listen       Into the Woods  1986
##  3      The Music of the Night Phantom of the Opera  1986
##  4         Getting to Know You       The King and I  1951
##  5         Getting to Know You       The King and I  1951
##  6             Shall We Dance?       The King and I  1951
##  7             Shall We Dance?       The King and I  1951
##  8 A Few of My Favorite Things   The Sound of Music  1959
##  9 A Few of My Favorite Things   The Sound of Music  1959
## 10                   Edelweiss   The Sound of Music  1959
## 11                   Edelweiss   The Sound of Music  1959
## 12                       Maria      West Side Story  1957
## # ... with 1 more variables: composer <chr>

Missing keys

Missing keys can mistakenly join two rows that do not belong toegether

singers

## # A tibble: 2 x 2
##                movie                singer
##                <chr>                 <chr>
## 1               <NA> Arnold Schwarzenegger
## 2 The Sound of Music         Julie Andrews

two_songs

## # A tibble: 2 x 2
##                  song              movie
##                 <chr>              <chr>
## 1            Do-Re-Mi The Sound of Music
## 2 A Spoonful of Sugar               <NA>

# Examine the result of joining singers to two_songs
two_songs %>% inner_join(singers, by = "movie")

## # A tibble: 2 x 3
##                  song              movie                singer
##                 <chr>              <chr>                 <chr>
## 1            Do-Re-Mi The Sound of Music         Julie Andrews
## 2 A Spoonful of Sugar               <NA> Arnold Schwarzenegger

# Remove NA's from key before joining
two_songs %>% 
  filter(!is.na(movie)) %>% 
  inner_join(singers, by = "movie")

## # A tibble: 1 x 3
##       song              movie        singer
##      <chr>              <chr>         <chr>
## 1 Do-Re-Mi The Sound of Music Julie Andrews

Defining the keys

If you do not set a by argument dplyr will join on all columns that match across the two datasets
You can use named vetors to match column with different names by = c("name1"" = "name2")
If there are non-key columns with the same names they wil show up as name.x and name.y
- You can change this by adding the suffix argument. suffix = c("name1","name2")

A subset of keys

The movie column is the key and name refers to different things in these two datasets
- We will clean up the name columns after the join

movie_years

## # A tibble: 10 x 3
##                      movie          name  year
##                      <chr>         <chr> <int>
##  1     The Road to Morocco   Bing Crosby  1942
##  2            Going My Way   Bing Crosby  1944
##  3          Anchors Aweigh Frank Sinatra  1945
##  4 Till the Clouds Roll By Frank Sinatra  1946
##  5         White Christmas   Bing Crosby  1954
##  6         The Tender Trap Frank Sinatra  1955
##  7            High Society   Bing Crosby  1956
##  8       The Joker is Wild Frank Sinatra  1957
##  9                Pal Joey Frank Sinatra  1957
## 10                 Can-Can Frank Sinatra  1960

movie_studios

## # A tibble: 10 x 2
##                      movie                  name
##                      <chr>                 <chr>
##  1     The Road to Morocco    Paramount Pictures
##  2            Going My Way    Paramount Pictures
##  3          Anchors Aweigh   Metro-Goldwyn-Mayer
##  4 Till the Clouds Roll By   Metro-Goldwyn-Mayer
##  5         White Christmas    Paramount Pictures
##  6         The Tender Trap   Metro-Goldwyn-Mayer
##  7            High Society   Metro-Goldwyn-Mayer
##  8       The Joker is Wild    Paramount Pictures
##  9                Pal Joey     Columbia Pictures
## 10                 Can-Can Twentieth-Century Fox

movie_years %>% 
  # Left join movie_studios to movie_years
  left_join(movie_studios, by = "movie") %>% 
  # Rename the columns: artist and studio
  rename(artist = name.x, studio = name.y)

## # A tibble: 10 x 4
##                      movie        artist  year                studio
##                      <chr>         <chr> <int>                 <chr>
##  1     The Road to Morocco   Bing Crosby  1942    Paramount Pictures
##  2            Going My Way   Bing Crosby  1944    Paramount Pictures
##  3          Anchors Aweigh Frank Sinatra  1945   Metro-Goldwyn-Mayer
##  4 Till the Clouds Roll By Frank Sinatra  1946   Metro-Goldwyn-Mayer
##  5         White Christmas   Bing Crosby  1954    Paramount Pictures
##  6         The Tender Trap Frank Sinatra  1955   Metro-Goldwyn-Mayer
##  7            High Society   Bing Crosby  1956   Metro-Goldwyn-Mayer
##  8       The Joker is Wild Frank Sinatra  1957    Paramount Pictures
##  9                Pal Joey Frank Sinatra  1957     Columbia Pictures
## 10                 Can-Can Frank Sinatra  1960 Twentieth-Century Fox

Mis-matched key names

Here we use the named argument to create the correct join since the keys have different column names
- This is a very common case in most relational databases. You will have something_dimension_id in the fact table and just id as the key in the dimension table

# Identify the key column
elvis_songs

## # A tibble: 5 x 2
##                                   name          movie
##                                  <chr>          <chr>
## 1 (You're So Square) Baby I Don't Care Jailhouse Rock
## 2         I Can't Help Falling in Love    Blue Hawaii
## 3                       Jailhouse Rock Jailhouse Rock
## 4                       Viva Las Vegas Viva Las Vegas
## 5                    You Don't Know Me       Clambake

elvis_movies

## # A tibble: 4 x 2
##             name  year
##            <chr> <int>
## 1 Jailhouse Rock  1957
## 2    Blue Hawaii  1961
## 3 Viva Las Vegas  1963
## 4       Clambake  1967

elvis_movies %>% 
  # Left join elvis_songs to elvis_movies by this column
  left_join(elvis_songs, by = c("name" = "movie")) %>% 
  # Rename columns
  rename(movie = name, song = name.y)

## # A tibble: 5 x 3
##            movie  year                                 song
##            <chr> <int>                                <chr>
## 1 Jailhouse Rock  1957 (You're So Square) Baby I Don't Care
## 2 Jailhouse Rock  1957                       Jailhouse Rock
## 3    Blue Hawaii  1961         I Can't Help Falling in Love
## 4 Viva Las Vegas  1963                       Viva Las Vegas
## 5       Clambake  1967                    You Don't Know Me

More mismatched names

We can rename columns in the select statment

# Identify the key columns
movie_directors

## # A tibble: 10 x 3
##                       name        director                studio
##                      <chr>           <chr>                 <chr>
##  1          Anchors Aweigh   George Sidney   Metro-Goldwyn-Mayer
##  2                 Can-Can     Walter Lang Twentieth-Century Fox
##  3            Going My Way     Leo McCarey    Paramount Pictures
##  4            High Society Charles Walters   Metro-Goldwyn-Mayer
##  5                Pal Joey   George Sidney     Columbia Pictures
##  6       The Joker is Wild   Charles Vidor    Paramount Pictures
##  7     The Road to Morocco    David Butler    Paramount Pictures
##  8         The Tender Trap Charles Walters   Metro-Goldwyn-Mayer
##  9 Till the Clouds Roll By   Richard Whorf   Metro-Goldwyn-Mayer
## 10         White Christmas  Michael Curtiz    Paramount Pictures

movie_years

## # A tibble: 10 x 3
##                      movie          name  year
##                      <chr>         <chr> <int>
##  1     The Road to Morocco   Bing Crosby  1942
##  2            Going My Way   Bing Crosby  1944
##  3          Anchors Aweigh Frank Sinatra  1945
##  4 Till the Clouds Roll By Frank Sinatra  1946
##  5         White Christmas   Bing Crosby  1954
##  6         The Tender Trap Frank Sinatra  1955
##  7            High Society   Bing Crosby  1956
##  8       The Joker is Wild Frank Sinatra  1957
##  9                Pal Joey Frank Sinatra  1957
## 10                 Can-Can Frank Sinatra  1960

movie_years %>% 
  # Left join movie_directors to movie_years
  left_join(movie_directors, by = c("movie" = "name")) %>% 
  # Arrange the columns using select()
  select(year, movie, artist = name, director, studio)

## # A tibble: 10 x 5
##     year                   movie        artist        director
##    <int>                   <chr>         <chr>           <chr>
##  1  1942     The Road to Morocco   Bing Crosby    David Butler
##  2  1944            Going My Way   Bing Crosby     Leo McCarey
##  3  1945          Anchors Aweigh Frank Sinatra   George Sidney
##  4  1946 Till the Clouds Roll By Frank Sinatra   Richard Whorf
##  5  1954         White Christmas   Bing Crosby  Michael Curtiz
##  6  1955         The Tender Trap Frank Sinatra Charles Walters
##  7  1956            High Society   Bing Crosby Charles Walters
##  8  1957       The Joker is Wild Frank Sinatra   Charles Vidor
##  9  1957                Pal Joey Frank Sinatra   George Sidney
## 10  1960                 Can-Can Frank Sinatra     Walter Lang
## # ... with 1 more variables: studio <chr>

Joining multiple tables

To join multiple tables we could do a bunch of left_joins in a row
The purrr package has a function reduce() that lets us apply a function recursively to a list of datasets
purrr is also written by hadley wickham so it will work well with dplyr functions
the purpose of purrr is to help us apply R functions to data in efficient ways

Join multiple tables

Using reduce to do a left_quick join of all tables

# Load the purrr library
library(purrr)

# Place supergroups, more_bands, and more_artists into a list
list(supergroups, more_bands, more_artists) %>% 
  # Use reduce to join together the contents of the list
  reduce(left_join, by = c("first" = "first","last" = "last"))

## # A tibble: 19 x 5
##                         supergroup   first          last
##                              <chr>   <chr>         <chr>
##  1                           Cream  Ginger         Baker
##  2                           Cream    Jack         Bruce
##  3                           Cream    Jack         Bruce
##  4                           Cream    Jack         Bruce
##  5                           Cream    Jack         Bruce
##  6                           Cream    Eric       Clapton
##  7 Crosby, Stills, Nash, and Young   David        Crosby
##  8 Crosby, Stills, Nash, and Young  Graham          Nash
##  9 Crosby, Stills, Nash, and Young Stephen        Stills
## 10 Crosby, Stills, Nash, and Young    Neil         Young
## 11                  The Highwaymen  Johnny          Cash
## 12                  The Highwaymen  Waylon      Jennings
## 13                  The Highwaymen    Kris Kristofferson
## 14                  The Highwaymen  Willie        Nelson
## 15          The Traveling Wilburys     Bob         Dylan
## 16          The Traveling Wilburys  George      Harrison
## 17          The Traveling Wilburys    Jeff         Lynne
## 18          The Traveling Wilburys     Roy       Orbison
## 19          The Traveling Wilburys     Tom         Petty
## # ... with 2 more variables: band <chr>, instrument <chr>

Filter multiple tables

Here we use reduce to do a filtering semi_join

list(more_artists, more_bands, supergroups) %>% 
  # Return rows of more_artists in all three datasets
  reduce(semi_join, by = c("first","last"))

## # A tibble: 11 x 3
##      first     last instrument
##      <chr>    <chr>      <chr>
##  1  Ginger    Baker      drums
##  2    Jack    Bruce       bass
##  3    Jack    Bruce       bass
##  4    Eric  Clapton     guitar
##  5   David   Crosby     guitar
##  6  Graham     Nash     guitar
##  7 Stephen   Stills     guitar
##  8    Neil    Young     guitar
##  9  George Harrison     guitar
## 10    Jeff    Lynne     vocals
## 11     Tom    Petty     vocals

Other implementations

We can do the same joins as we have been using here wih the merge function, but the syntax is more confusing
- I will always just use dplyr joins
The dplyr joins translate to sql joins as show
- These are good to know as often I am exploring a sql database before pulling data into R. It happens.
These are the DBMSs (Database management systems) you can easily connect to from R
- Once you have connected you can write dplyr code and it will be translated to sql and run in the DBMS
- This is incredible useful. Its worthy of its own corse.
- checkout vignette("dbplyr") to see more about dplyr’s database capabilities
- We can use collect to pull the data into R when we are ready or data.frame will also pull it in.

SQL

Each dplyr join has an SQL equivalent, and when you apply a dplyr join to a SQL data source (instead of a data frame), dplyr automatically converts your join to its SQL equivalent and runs it in the database.

Base R

# Alter the code to perform the join with a dplyr function
merge(bands, artists, by = c("first", "last"), all.x = TRUE) %>%
  arrange(band)

##        first      last               band instrument
## 1      Jimmy      Page       Led Zeppelin     Guitar
## 2       John    Bonham       Led Zeppelin       <NA>
## 3  John Paul     Jones       Led Zeppelin       <NA>
## 4     Robert     Plant       Led Zeppelin       <NA>
## 5     George  Harrison        The Beatles     Guitar
## 6       John    Lennon        The Beatles     Guitar
## 7       Paul McCartney        The Beatles       Bass
## 8      Ringo     Starr        The Beatles      Drums
## 9      Jimmy   Buffett  The Coral Reefers     Guitar
## 10   Charlie     Watts The Rolling Stones       <NA>
## 11     Keith  Richards The Rolling Stones     Guitar
## 12      Mick    Jagger The Rolling Stones     Vocals
## 13    Ronnie      Wood The Rolling Stones       <NA>

bands %>% left_join(artists, by = c("first", "last"))

## # A tibble: 13 x 4
##        first      last               band instrument
##        <chr>     <chr>              <chr>      <chr>
##  1      John    Bonham       Led Zeppelin       <NA>
##  2 John Paul     Jones       Led Zeppelin       <NA>
##  3     Jimmy      Page       Led Zeppelin     Guitar
##  4    Robert     Plant       Led Zeppelin       <NA>
##  5    George  Harrison        The Beatles     Guitar
##  6      John    Lennon        The Beatles     Guitar
##  7      Paul McCartney        The Beatles       Bass
##  8     Ringo     Starr        The Beatles      Drums
##  9     Jimmy   Buffett  The Coral Reefers     Guitar
## 10      Mick    Jagger The Rolling Stones     Vocals
## 11     Keith  Richards The Rolling Stones     Guitar
## 12   Charlie     Watts The Rolling Stones       <NA>
## 13    Ronnie      Wood The Rolling Stones       <NA>

Case Study

Lahman’s Baseball Database

library(Lahman)
Sean Lahman’s baseball database
26 tables
A great example of real worl relational data

Universal keys?

In particular, do any variable names span all of the datasets?

library(Lahman)

# Examine lahmanNames
str(lahmanNames)

## List of 28
##  $ AllstarFull        :Classes 'tbl_df', 'tbl' and 'data.frame': 8 obs. of  1 variable:
##   ..$ var: chr [1:8] "playerID" "yearID" "gameNum" "gameID" ...
##  $ Appearances        :Classes 'tbl_df', 'tbl' and 'data.frame': 21 obs. of  1 variable:
##   ..$ var: chr [1:21] "yearID" "teamID" "lgID" "playerID" ...
##  $ AwardsManagers     :Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of  1 variable:
##   ..$ var: chr [1:6] "playerID" "awardID" "yearID" "lgID" ...
##  $ AwardsPlayers      :Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of  1 variable:
##   ..$ var: chr [1:6] "playerID" "awardID" "yearID" "lgID" ...
##  $ AwardsShareManagers:Classes 'tbl_df', 'tbl' and 'data.frame': 7 obs. of  1 variable:
##   ..$ var: chr [1:7] "awardID" "yearID" "lgID" "playerID" ...
##  $ AwardsSharePlayers :Classes 'tbl_df', 'tbl' and 'data.frame': 7 obs. of  1 variable:
##   ..$ var: chr [1:7] "awardID" "yearID" "lgID" "playerID" ...
##  $ Batting            :Classes 'tbl_df', 'tbl' and 'data.frame': 22 obs. of  1 variable:
##   ..$ var: chr [1:22] "playerID" "yearID" "stint" "teamID" ...
##  $ battingLabels      :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  1 variable:
##   ..$ var: chr [1:2] "variable" "label"
##  $ BattingPost        :Classes 'tbl_df', 'tbl' and 'data.frame': 22 obs. of  1 variable:
##   ..$ var: chr [1:22] "yearID" "round" "playerID" "teamID" ...
##  $ CollegePlaying     :Classes 'tbl_df', 'tbl' and 'data.frame': 3 obs. of  1 variable:
##   ..$ var: chr [1:3] "playerID" "schoolID" "yearID"
##  $ Fielding           :Classes 'tbl_df', 'tbl' and 'data.frame': 18 obs. of  1 variable:
##   ..$ var: chr [1:18] "playerID" "yearID" "stint" "teamID" ...
##  $ fieldingLabels     :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  1 variable:
##   ..$ var: chr [1:2] "variable" "label"
##  $ FieldingOF         :Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of  1 variable:
##   ..$ var: chr [1:6] "playerID" "yearID" "stint" "Glf" ...
##  $ FieldingPost       :Classes 'tbl_df', 'tbl' and 'data.frame': 17 obs. of  1 variable:
##   ..$ var: chr [1:17] "playerID" "yearID" "teamID" "lgID" ...
##  $ HallOfFame         :Classes 'tbl_df', 'tbl' and 'data.frame': 9 obs. of  1 variable:
##   ..$ var: chr [1:9] "playerID" "yearID" "votedBy" "ballots" ...
##  $ LahmanData         :Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of  1 variable:
##   ..$ var: chr [1:5] "file" "class" "nobs" "nvar" ...
##  $ Managers           :Classes 'tbl_df', 'tbl' and 'data.frame': 10 obs. of  1 variable:
##   ..$ var: chr [1:10] "playerID" "yearID" "teamID" "lgID" ...
##  $ ManagersHalf       :Classes 'tbl_df', 'tbl' and 'data.frame': 10 obs. of  1 variable:
##   ..$ var: chr [1:10] "playerID" "yearID" "teamID" "lgID" ...
##  $ Master             :Classes 'tbl_df', 'tbl' and 'data.frame': 26 obs. of  1 variable:
##   ..$ var: chr [1:26] "playerID" "birthYear" "birthMonth" "birthDay" ...
##  $ Pitching           :Classes 'tbl_df', 'tbl' and 'data.frame': 30 obs. of  1 variable:
##   ..$ var: chr [1:30] "playerID" "yearID" "stint" "teamID" ...
##  $ pitchingLabels     :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  1 variable:
##   ..$ var: chr [1:2] "variable" "label"
##  $ PitchingPost       :Classes 'tbl_df', 'tbl' and 'data.frame': 30 obs. of  1 variable:
##   ..$ var: chr [1:30] "playerID" "yearID" "round" "teamID" ...
##  $ Salaries           :Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of  1 variable:
##   ..$ var: chr [1:5] "yearID" "teamID" "lgID" "playerID" ...
##  $ Schools            :Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of  1 variable:
##   ..$ var: chr [1:5] "schoolID" "name_full" "city" "state" ...
##  $ SeriesPost         :Classes 'tbl_df', 'tbl' and 'data.frame': 9 obs. of  1 variable:
##   ..$ var: chr [1:9] "yearID" "round" "teamIDwinner" "lgIDwinner" ...
##  $ Teams              :Classes 'tbl_df', 'tbl' and 'data.frame': 48 obs. of  1 variable:
##   ..$ var: chr [1:48] "yearID" "lgID" "teamID" "franchID" ...
##  $ TeamsFranchises    :Classes 'tbl_df', 'tbl' and 'data.frame': 4 obs. of  1 variable:
##   ..$ var: chr [1:4] "franchID" "franchName" "active" "NAassoc"
##  $ TeamsHalf          :Classes 'tbl_df', 'tbl' and 'data.frame': 10 obs. of  1 variable:
##   ..$ var: chr [1:10] "yearID" "lgID" "teamID" "Half" ...

# Find variables in common
reduce(lahmanNames, intersect)

## # A tibble: 0 x 1
## # ... with 1 variables: var <chr>

Common keys?

lahmanNames %>%  
  # Bind the data frames in lahmanNames
  bind_rows(.id = 'dataframe') %>%
  # Group the result by var
  group_by(var) %>%
  # Tally the number of appearances
  tally() %>%
  # Filter the data
  filter(n > 1) %>% 
  # Arrange the results
  arrange(desc(n))

## # A tibble: 59 x 2
##         var     n
##       <chr> <int>
##  1   yearID    21
##  2 playerID    19
##  3     lgID    17
##  4   teamID    13
##  5        G    10
##  6        L     6
##  7        W     6
##  8       BB     5
##  9       CS     5
## 10       GS     5
## # ... with 49 more rows

playerID

lahmanNames %>% 
  # Bind the data frames
  bind_rows(.id = 'dataframe') %>%
  # Filter the results
  filter(var == 'playerID') %>% 
  # Extract the dataframe variable
  `$`(dataframe)

##  [1] "AllstarFull"         "Appearances"         "AwardsManagers"     
##  [4] "AwardsPlayers"       "AwardsShareManagers" "AwardsSharePlayers" 
##  [7] "Batting"             "BattingPost"         "CollegePlaying"     
## [10] "Fielding"            "FieldingOF"          "FieldingPost"       
## [13] "HallOfFame"          "Managers"            "ManagersHalf"       
## [16] "Master"              "Pitching"            "PitchingPost"       
## [19] "Salaries"

Salaries

Who are the players?

Get count of distinct players

players <- Master %>% 
  # Return one row for each distinct player
  distinct(playerID, nameFirst, nameLast)
  
nrow(players)

## [1] 18846

nrow(Master)

## [1] 18846

Missing salaries

How many players are completely missing salary information?

head(players)

##    playerID nameFirst nameLast
## 1 aardsda01     David  Aardsma
## 2 aaronha01      Hank    Aaron
## 3 aaronto01    Tommie    Aaron
## 4  aasedo01       Don     Aase
## 5  abadan01      Andy     Abad
## 6  abadfe01  Fernando     Abad

head(Salaries)

##   yearID teamID lgID  playerID salary
## 1   1985    ATL   NL barkele01 870000
## 2   1985    ATL   NL bedrost01 550000
## 3   1985    ATL   NL benedbr01 545000
## 4   1985    ATL   NL  campri01 633333
## 5   1985    ATL   NL ceronri01 625000
## 6   1985    ATL   NL chambch01 800000

players %>% 
  # Find all players who do not appear in Salaries
  anti_join(Salaries, by = 'playerID') %>%
  # Count them
  count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1 13888

Unpaid games?

Is it possible that these players somehow did not play (and hence did not earn a salary)?

head(Appearances)

##   yearID teamID lgID  playerID G_all GS G_batting G_defense G_p G_c G_1b
## 1   1871    BS1   NA barnero01    31 NA        31        31   0   0    0
## 2   1871    BS1   NA barrofr01    18 NA        18        18   0   0    0
## 3   1871    BS1   NA birdsda01    29 NA        29        29   0   7    0
## 4   1871    BS1   NA  conefr01    19 NA        19        19   0   0    0
## 5   1871    BS1   NA gouldch01    31 NA        31        31   0   0   30
## 6   1871    BS1   NA jackssa01    16 NA        16        16   0   0    0
##   G_2b G_3b G_ss G_lf G_cf G_rf G_of G_dh G_ph G_pr
## 1   16    0   15    0    0    0    0   NA   NA   NA
## 2    1    0    0   13    0    4   17   NA   NA   NA
## 3    0    0    0    0    0   27   27   NA   NA   NA
## 4    0    0    0   18    0    1   18   NA   NA   NA
## 5    0    0    0    0    0    1    1   NA   NA   NA
## 6   14    0    1    0    1    0    1   NA   NA   NA

players %>% 
  anti_join(Salaries, by = "playerID") %>% 
  # How many unsalaried players appear in Appearances?
  semi_join(Appearances, by = "playerID") %>% 
  count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1 13695

How many games?

Can you determine how many games each of these unsalaried players played?

players %>% 
  # Find all players who do not appear in Salaries
  anti_join(Salaries, by = "playerID") %>% 
  # Join them to Appearances
  left_join(Appearances, by = "playerID") %>% 
  # Calculate total_games for each player
  group_by(playerID) %>%
  summarise(total_games = sum(G_all, na.rm = T)) %>%
  # Arrange in descending order by total_games
  arrange(desc(total_games))

## # A tibble: 13,888 x 2
##     playerID total_games
##        <chr>       <int>
##  1 yastrca01        3308
##  2 aaronha01        3298
##  3  cobbty01        3034
##  4 musiast01        3026
##  5  mayswi01        2992
##  6 robinbr01        2896
##  7 kalinal01        2834
##  8 collied01        2826
##  9 robinfr02        2808
## 10 wagneho01        2794
## # ... with 13,878 more rows

How many at-bats?

Is it possible that the unsalaried players did not actually play in the games that they appeared in?

head(Batting)

##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2
## 4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0
## 5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2
## 6 armstbo01   1871     1    FW1   NA 12  49  9 11   2   1  0   5  0  1  0
##   SO IBB HBP SH SF GIDP
## 1  0  NA  NA NA NA   NA
## 2  0  NA  NA NA NA   NA
## 3  5  NA  NA NA NA   NA
## 4  2  NA  NA NA NA   NA
## 5  1  NA  NA NA NA   NA
## 6  1  NA  NA NA NA   NA

players %>%
  # Find unsalaried players
  anti_join(Salaries, by = "playerID") %>% 
  # Join Batting to the unsalaried players
  left_join(Batting, by = "playerID") %>% 
  # Group by player
  group_by(playerID) %>% 
  # Sum at-bats for each player
  summarize(total_at_bat = sum(AB, na.rm = T)) %>% 
  # Arrange in descending order
  arrange(desc(total_at_bat))

## # A tibble: 13,888 x 2
##     playerID total_at_bat
##        <chr>        <int>
##  1 aaronha01        12364
##  2 yastrca01        11988
##  3  cobbty01        11434
##  4 musiast01        10972
##  5  mayswi01        10881
##  6 robinbr01        10654
##  7 wagneho01        10430
##  8 brocklo01        10332
##  9 ansonca01        10277
## 10 aparilu01        10230
## # ... with 13,878 more rows

Introducing the hall of fame

Hall of fame nominations

Let’s see how many players have been nominated for the Hall of Fame.

head(HallOfFame)

##    playerID yearID votedBy ballots needed votes inducted category
## 1  cobbty01   1936   BBWAA     226    170   222        Y   Player
## 2  ruthba01   1936   BBWAA     226    170   215        Y   Player
## 3 wagneho01   1936   BBWAA     226    170   215        Y   Player
## 4 mathech01   1936   BBWAA     226    170   205        Y   Player
## 5 johnswa01   1936   BBWAA     226    170   189        Y   Player
## 6 lajoina01   1936   BBWAA     226    170   146        N   Player
##   needed_note
## 1        <NA>
## 2        <NA>
## 3        <NA>
## 4        <NA>
## 5        <NA>
## 6        <NA>

# Find the distinct players that appear in HallOfFame
nominated <- HallOfFame %>% 
  distinct(playerID)

nominated %>% 
  # Count the number of players in nominated
  count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1  1239

nominated_full <- nominated %>% 
  # Join to Master
  left_join(Master, by = "playerID") %>% 
  # Return playerID, nameFirst, nameLast
  select(playerID, nameFirst, nameLast)

head(nominated_full)

##    playerID nameFirst  nameLast
## 1  cobbty01        Ty      Cobb
## 2  ruthba01      Babe      Ruth
## 3 wagneho01     Honus    Wagner
## 4 mathech01   Christy Mathewson
## 5 johnswa01    Walter   Johnson
## 6 lajoina01       Nap    Lajoie

Hall of fame inductions

Let’s now see how many players were admitted to the hall of fame to examine how selective the voting process is.

# Find distinct players in HallOfFame with inducted == "Y"
inducted <- HallOfFame %>% 
  filter(inducted == "Y") %>%
  distinct(playerID)

# Note if you wanted to keep all of the columns you can use keep_all in the distinct function
HallOfFame %>%
  distinct(playerID, .keep_all = TRUE) %>%
  filter(inducted == "Y") %>% 
  head()

##    playerID yearID    votedBy ballots needed votes inducted
## 1  cobbty01   1936      BBWAA     226    170   222        Y
## 2  ruthba01   1936      BBWAA     226    170   215        Y
## 3 wagneho01   1936      BBWAA     226    170   215        Y
## 4 mathech01   1936      BBWAA     226    170   205        Y
## 5 johnswa01   1936      BBWAA     226    170   189        Y
## 6 bulkemo99   1937 Centennial      NA     NA    NA        Y
##            category needed_note
## 1            Player        <NA>
## 2            Player        <NA>
## 3            Player        <NA>
## 4            Player        <NA>
## 5            Player        <NA>
## 6 Pioneer/Executive        <NA>

inducted %>% 
  # Count the number of players in inducted
  count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1   312

inducted_full <- inducted %>% 
  # Join to Master
  left_join(Master, by = "playerID") %>% 
  # Return playerID, nameFirst, nameLast
  select(playerID, nameFirst, nameLast)

head(inducted_full)

##    playerID nameFirst  nameLast
## 1  cobbty01        Ty      Cobb
## 2  ruthba01      Babe      Ruth
## 3 wagneho01     Honus    Wagner
## 4 mathech01   Christy Mathewson
## 5 johnswa01    Walter   Johnson
## 6 lajoina01       Nap    Lajoie

Awards

Did nominees who were inducted earn more awards than nominees who were not inducted?

# Tally the number of awards in AwardsPlayers by playerID
nAwards <- AwardsPlayers %>% 
  group_by(playerID) %>% 
  tally()

head(nAwards)

## # A tibble: 6 x 2
##    playerID     n
##       <chr> <int>
## 1 aaronha01    16
## 2 abbotji01     2
## 3 abernte02     2
## 4 abreubo01     2
## 5 abreujo02     2
## 6 adamsba01     2

nAwards %>% 
  # Filter to just the players in inducted 
  semi_join(inducted, by = "playerID") %>% 
  # Calculate the mean number of awards per player
  summarize(avg_n = mean(n, na.rm = T))

## # A tibble: 1 x 1
##      avg_n
##      <dbl>
## 1 12.10582

nAwards %>% 
  # Filter to just the players in nominated 
  semi_join(nominated, by = "playerID") %>% 
  # Filter to players NOT in inducted 
  anti_join(inducted, by = "playerID") %>%
  # Calculate the mean number of awards per player
  summarize(avg_n = mean(n, na.rm = T))

## # A tibble: 1 x 1
##     avg_n
##     <dbl>
## 1 4.18985

Salary

Does the maximum salary earned by inductees tend to be greater than the maximum salary earned by nominees who were not inducted?

# Find the players who are in nominated, but not inducted
notInducted <- nominated %>% 
  setdiff(inducted, by = "playerID")

Salaries %>% 
  # Find the players who are in notInducted
  semi_join(notInducted) %>%
  # Calculate the max salary by player
  group_by(playerID) %>% 
  summarize(max_salary = max(salary)) %>% 
  # Calculate the average of the max salaries
  summarize(avg_salary = mean(max_salary))

## # A tibble: 1 x 1
##   avg_salary
##        <dbl>
## 1    4876812

# Repeat for players who were inducted
Salaries %>% 
  semi_join(inducted) %>%
  # Calculate the max salary by player
  group_by(playerID) %>% 
  summarize(max_salary = max(salary)) %>% 
  # Calculate the average of the max salaries
  summarize(avg_salary = mean(max_salary))

## # A tibble: 1 x 1
##   avg_salary
##        <dbl>
## 1    5673190

Retirement

One of the rules of the Hall of Fame is that players cannot be nominated until five years after they retire. Is this reflected in our data?

head(Appearances)

##   yearID teamID lgID  playerID G_all GS G_batting G_defense G_p G_c G_1b
## 1   1871    BS1   NA barnero01    31 NA        31        31   0   0    0
## 2   1871    BS1   NA barrofr01    18 NA        18        18   0   0    0
## 3   1871    BS1   NA birdsda01    29 NA        29        29   0   7    0
## 4   1871    BS1   NA  conefr01    19 NA        19        19   0   0    0
## 5   1871    BS1   NA gouldch01    31 NA        31        31   0   0   30
## 6   1871    BS1   NA jackssa01    16 NA        16        16   0   0    0
##   G_2b G_3b G_ss G_lf G_cf G_rf G_of G_dh G_ph G_pr
## 1   16    0   15    0    0    0    0   NA   NA   NA
## 2    1    0    0   13    0    4   17   NA   NA   NA
## 3    0    0    0    0    0   27   27   NA   NA   NA
## 4    0    0    0   18    0    1   18   NA   NA   NA
## 5    0    0    0    0    0    1    1   NA   NA   NA
## 6   14    0    1    0    1    0    1   NA   NA   NA

Appearances %>% 
  # Filter Appearances against nominated
  semi_join(nominated, by = "playerID") %>%
  # Find last year played by player
  group_by(playerID) %>% 
  summarize(last_year = max(yearID)) %>%
  # Join to full HallOfFame
  left_join(HallOfFame, by = "playerID") %>% 
  # Filter for unusual observations
  filter(yearID <= last_year)

## # A tibble: 39 x 10
##     playerID last_year yearID         votedBy ballots needed votes
##        <chr>     <dbl>  <int>           <chr>   <int>  <int> <int>
##  1 cissebi01      1938   1937           BBWAA     201    151     1
##  2 cochrmi01      1937   1936           BBWAA     226    170    80
##  3  deandi01      1947   1945           BBWAA     247    186    17
##  4  deandi01      1947   1946    Final Ballot     263    198    45
##  5  deandi01      1947   1946 Nominating Vote     202     NA    40
##  6  deandi01      1947   1947           BBWAA     161    121    88
##  7 dickebi01      1946   1945           BBWAA     247    186    17
##  8 dickebi01      1946   1946 Nominating Vote     202     NA    40
##  9 dickebi01      1946   1946    Final Ballot     263    198    32
## 10 dimagjo01      1951   1945           BBWAA     247    186     1
## # ... with 29 more rows, and 3 more variables: inducted <fctr>,
## #   category <fctr>, needed_note <chr>

Joining Data in R with dplyr

William Surles

2017-07-20

Introduction

Whats Covered

Aditional Resources

Mutating Joins

Welcome to the course!

Keys

Primary Keys

Secondary Keys

Multi-variable keys

Joins

A basic join

A second join

A right join

Variations on joins

inner joins and full joins

Pipes

Practice with pipes and joins

Choose your joins

Filtering joins and set operations

Semi-joins

Apply a semi-join

Exploring with semi-joins

A more precise way to filter?

Anti-joins

Apply an anti-join

Apply another anti-join

Which filtering join?

Set operations

How many songs are there?

Greatest hits

Live! Bootleg songs

Multiple operations

Comparing datasets

Apply setequal

Apply setequal again

Comparing albums

Assembling data

Binds

Which bind?

Bind rows

Bind columns

Danger

Build a better data frame

Make a data frame

Lists of columns

List of rows (data frames)

Working with data types

Atomic data types

dplyr’s coercion rules

dplyr and coercion

Determining type

Results

Advanced Joining

What can go wrong?

Spot the key

Non-unique keys

Two non-unique keys

Missing keys

Defining the keys

A subset of keys

Mis-matched key names

More mismatched names

Joining multiple tables

Join multiple tables

Filter multiple tables

Other implementations

SQL

Base R

Case Study

Lahman’s Baseball Database

Universal keys?

Common keys?

playerID

Salaries

Who are the players?

Missing salaries

Unpaid games?