Introduction

Whats Covered

  • Mutating joins
  • Filtering joins and set operations
  • Assembling data
  • Advanced joining
  • Case Study

Mutating Joins


Welcome to the course!

  • Advantages to dplyr over base R merge function for joing data
    • dplyr always preserves the row order
    • dplyr has much more intuitive syntax
    • dplyr can be applied to databases, or spark
  • dplyr is a front end language for maniulating data that can be converted to multiple backends like SQL or spark.
    • This is pretty useful. Everything we cover in this class could work on a spark dataframe or a sql table

Keys

  • The Primary key needs to be unique in a table
  • The foreign key in the second table can be duplicated
  • second table will be matched to the primary table based on the primary key
  • The primary key may be one, two or even more columns in the table

Primary Keys

  • What is the primary key in artists dataset?
    • name
head(artists0, 10)
## # A tibble: 10 x 2
##               name instrument
##              <chr>      <chr>
##  1   Jimmy Buffett     Guitar
##  2 George Harrison     Guitar
##  3     Mick Jagger     Vocals
##  4       Tom Jones     Vocals
##  5      Davy Jones     Vocals
##  6     John Lennon     Guitar
##  7  Paul McCartney       Bass
##  8      Jimmy Page     Guitar
##  9       Joe Perry     Guitar
## 10   Elvis Presley     Vocals

Secondary Keys

  • Which datasets have a primary key that matches artists$name?
    • bands$name
    • songs$writer
head(albums0)
## # A tibble: 6 x 3
##                        name               band  year
##                       <chr>              <chr> <int>
## 1        A Hard Day's Night        The Beatles  1964
## 2      Magical Mystery Tour        The Beatles  1967
## 3          Beggar's Banquet The Rolling Stones  1968
## 4                Abbey Road        The Beatles  1969
## 5           Led Zeppelin IV       Led Zeppelin  1971
## 6 The Dark Side of the Moon         Pink Floyd  1973
head(bands0)
## # A tibble: 6 x 2
##              name         band
##             <chr>        <chr>
## 1     John Bonham Led Zeppelin
## 2 John Paul Jones Led Zeppelin
## 3      Jimmy Page Led Zeppelin
## 4    Robert Plant Led Zeppelin
## 5 George Harrison  The Beatles
## 6     John Lennon  The Beatles
head(songs0)
## # A tibble: 4 x 3
##               song                album         writer
##              <chr>                <chr>          <chr>
## 1    Come Together           Abbey Road    John Lennon
## 2         Dream On            Aerosmith   Steven Tyler
## 3   Hello, Goodbye Magical Mystery Tour Paul McCartney
## 4 It's Not Unusual     Along Came Jones      Tom Jones

Multi-variable keys

  • What is the primary key in artists?
    • The combination of first and last
head(artists)
## # A tibble: 6 x 3
##    first     last instrument
##    <chr>    <chr>      <chr>
## 1  Jimmy  Buffett     Guitar
## 2 George Harrison     Guitar
## 3   Mick   Jagger     Vocals
## 4    Tom    Jones     Vocals
## 5   Davy    Jones     Vocals
## 6   John   Lennon     Guitar

Joins

  • Joins in dplyr will work on data frames, tibbles (tbl_df), and tbl references

A basic join

head(artists)
## # A tibble: 6 x 3
##    first     last instrument
##    <chr>    <chr>      <chr>
## 1  Jimmy  Buffett     Guitar
## 2 George Harrison     Guitar
## 3   Mick   Jagger     Vocals
## 4    Tom    Jones     Vocals
## 5   Davy    Jones     Vocals
## 6   John   Lennon     Guitar
head(bands)
## # A tibble: 6 x 3
##       first     last         band
##       <chr>    <chr>        <chr>
## 1      John   Bonham Led Zeppelin
## 2 John Paul    Jones Led Zeppelin
## 3     Jimmy     Page Led Zeppelin
## 4    Robert    Plant Led Zeppelin
## 5    George Harrison  The Beatles
## 6      John   Lennon  The Beatles
# Complete the code to join artists to bands
bands2 <- left_join(bands, artists, by = c('first','last'))

# Examine the results
bands2
## # A tibble: 13 x 4
##        first      last               band instrument
##        <chr>     <chr>              <chr>      <chr>
##  1      John    Bonham       Led Zeppelin       <NA>
##  2 John Paul     Jones       Led Zeppelin       <NA>
##  3     Jimmy      Page       Led Zeppelin     Guitar
##  4    Robert     Plant       Led Zeppelin       <NA>
##  5    George  Harrison        The Beatles     Guitar
##  6      John    Lennon        The Beatles     Guitar
##  7      Paul McCartney        The Beatles       Bass
##  8     Ringo     Starr        The Beatles      Drums
##  9     Jimmy   Buffett  The Coral Reefers     Guitar
## 10      Mick    Jagger The Rolling Stones     Vocals
## 11     Keith  Richards The Rolling Stones     Guitar
## 12   Charlie     Watts The Rolling Stones       <NA>
## 13    Ronnie      Wood The Rolling Stones       <NA>

A second join

  • Notice how the last column is pulled in as last.x and last.y if not included in the join
    • And how all combination of rows are created for each name that is duplicated, e.g. john
left_join(bands, artists, by = "first")
## # A tibble: 16 x 5
##        first    last.x               band    last.y instrument
##        <chr>     <chr>              <chr>     <chr>      <chr>
##  1      John    Bonham       Led Zeppelin    Lennon     Guitar
##  2 John Paul     Jones       Led Zeppelin      <NA>       <NA>
##  3     Jimmy      Page       Led Zeppelin   Buffett     Guitar
##  4     Jimmy      Page       Led Zeppelin      Page     Guitar
##  5    Robert     Plant       Led Zeppelin      <NA>       <NA>
##  6    George  Harrison        The Beatles  Harrison     Guitar
##  7      John    Lennon        The Beatles    Lennon     Guitar
##  8      Paul McCartney        The Beatles McCartney       Bass
##  9      Paul McCartney        The Beatles     Simon     Guitar
## 10     Ringo     Starr        The Beatles     Starr      Drums
## 11     Jimmy   Buffett  The Coral Reefers   Buffett     Guitar
## 12     Jimmy   Buffett  The Coral Reefers      Page     Guitar
## 13      Mick    Jagger The Rolling Stones    Jagger     Vocals
## 14     Keith  Richards The Rolling Stones  Richards     Guitar
## 15   Charlie     Watts The Rolling Stones      <NA>       <NA>
## 16    Ronnie      Wood The Rolling Stones      <NA>       <NA>
# Fix the code to recreate bands3
left_join(bands, artists, by = c("first","last"))
## # A tibble: 13 x 4
##        first      last               band instrument
##        <chr>     <chr>              <chr>      <chr>
##  1      John    Bonham       Led Zeppelin       <NA>
##  2 John Paul     Jones       Led Zeppelin       <NA>
##  3     Jimmy      Page       Led Zeppelin     Guitar
##  4    Robert     Plant       Led Zeppelin       <NA>
##  5    George  Harrison        The Beatles     Guitar
##  6      John    Lennon        The Beatles     Guitar
##  7      Paul McCartney        The Beatles       Bass
##  8     Ringo     Starr        The Beatles      Drums
##  9     Jimmy   Buffett  The Coral Reefers     Guitar
## 10      Mick    Jagger The Rolling Stones     Vocals
## 11     Keith  Richards The Rolling Stones     Guitar
## 12   Charlie     Watts The Rolling Stones       <NA>
## 13    Ronnie      Wood The Rolling Stones       <NA>

A right join

# Finish the code below to recreate bands3 with a right join
bands2 <- left_join(bands, artists, by = c("first", "last"))

bands3 <- right_join(artists, bands, by = c("first", "last"))

# Check that bands2 is equal to bands3
head(bands2)
## # A tibble: 6 x 4
##       first     last         band instrument
##       <chr>    <chr>        <chr>      <chr>
## 1      John   Bonham Led Zeppelin       <NA>
## 2 John Paul    Jones Led Zeppelin       <NA>
## 3     Jimmy     Page Led Zeppelin     Guitar
## 4    Robert    Plant Led Zeppelin       <NA>
## 5    George Harrison  The Beatles     Guitar
## 6      John   Lennon  The Beatles     Guitar
head(bands3)
## # A tibble: 6 x 4
##       first     last instrument         band
##       <chr>    <chr>      <chr>        <chr>
## 1      John   Bonham       <NA> Led Zeppelin
## 2 John Paul    Jones       <NA> Led Zeppelin
## 3     Jimmy     Page     Guitar Led Zeppelin
## 4    Robert    Plant       <NA> Led Zeppelin
## 5    George Harrison     Guitar  The Beatles
## 6      John   Lennon     Guitar  The Beatles
setequal(bands2, bands3)
## TRUE

Variations on joins

  • left_join - prioritizes left dataset
  • right_join - prioritizes right dataset
  • inner_join - only retains rows in both datasets
  • full_join - retains all rows
  • Use %>% (pipes) to string together these joins

inner joins and full joins

head(songs)
## # A tibble: 4 x 4
##               song                album  first      last
##              <chr>                <chr>  <chr>     <chr>
## 1    Come Together           Abbey Road   John    Lennon
## 2         Dream On            Aerosmith Steven     Tyler
## 3   Hello, Goodbye Magical Mystery Tour   Paul McCartney
## 4 It's Not Unusual     Along Came Jones    Tom     Jones
head(albums)
## # A tibble: 6 x 3
##                       album               band  year
##                       <chr>              <chr> <int>
## 1        A Hard Day's Night        The Beatles  1964
## 2      Magical Mystery Tour        The Beatles  1967
## 3          Beggar's Banquet The Rolling Stones  1968
## 4                Abbey Road        The Beatles  1969
## 5           Led Zeppelin IV       Led Zeppelin  1971
## 6 The Dark Side of the Moon         Pink Floyd  1973
inner_join(songs, albums, by = "album")
## # A tibble: 3 x 6
##             song                album  first      last        band  year
##            <chr>                <chr>  <chr>     <chr>       <chr> <int>
## 1  Come Together           Abbey Road   John    Lennon The Beatles  1969
## 2       Dream On            Aerosmith Steven     Tyler   Aerosmith  1973
## 3 Hello, Goodbye Magical Mystery Tour   Paul McCartney The Beatles  1967
# Join bands to artists using full_join()
head(artists)
## # A tibble: 6 x 3
##    first     last instrument
##    <chr>    <chr>      <chr>
## 1  Jimmy  Buffett     Guitar
## 2 George Harrison     Guitar
## 3   Mick   Jagger     Vocals
## 4    Tom    Jones     Vocals
## 5   Davy    Jones     Vocals
## 6   John   Lennon     Guitar
head(bands)
## # A tibble: 6 x 3
##       first     last         band
##       <chr>    <chr>        <chr>
## 1      John   Bonham Led Zeppelin
## 2 John Paul    Jones Led Zeppelin
## 3     Jimmy     Page Led Zeppelin
## 4    Robert    Plant Led Zeppelin
## 5    George Harrison  The Beatles
## 6      John   Lennon  The Beatles
full_join(artists, bands, by = c("first","last"))
## # A tibble: 21 x 4
##     first      last instrument               band
##     <chr>     <chr>      <chr>              <chr>
##  1  Jimmy   Buffett     Guitar  The Coral Reefers
##  2 George  Harrison     Guitar        The Beatles
##  3   Mick    Jagger     Vocals The Rolling Stones
##  4    Tom     Jones     Vocals               <NA>
##  5   Davy     Jones     Vocals               <NA>
##  6   John    Lennon     Guitar        The Beatles
##  7   Paul McCartney       Bass        The Beatles
##  8  Jimmy      Page     Guitar       Led Zeppelin
##  9    Joe     Perry     Guitar               <NA>
## 10  Elvis   Presley     Vocals               <NA>
## # ... with 11 more rows

Pipes

# Find guitarists in bands dataset (don't change)
temp <- left_join(bands, artists, by = c("first", "last"))
temp <- filter(temp, instrument == "Guitar")
select(temp, first, last, band)
## # A tibble: 5 x 3
##    first     last               band
##    <chr>    <chr>              <chr>
## 1  Jimmy     Page       Led Zeppelin
## 2 George Harrison        The Beatles
## 3   John   Lennon        The Beatles
## 4  Jimmy  Buffett  The Coral Reefers
## 5  Keith Richards The Rolling Stones
# Reproduce code above using pipes
bands %>% 
  left_join(artists, by = c("first","last")) %>%
  filter(instrument == "Guitar") %>%
  select(first, last, band)
## # A tibble: 5 x 3
##    first     last               band
##    <chr>    <chr>              <chr>
## 1  Jimmy     Page       Led Zeppelin
## 2 George Harrison        The Beatles
## 3   John   Lennon        The Beatles
## 4  Jimmy  Buffett  The Coral Reefers
## 5  Keith Richards The Rolling Stones

Practice with pipes and joins

# Examine the contents of the goal dataset
goal
## # A tibble: 3 x 6
##   first      last instrument        band             song
##   <chr>     <chr>      <chr>       <chr>            <chr>
## 1   Tom     Jones     Vocals        <NA> It's Not Unusual
## 2  John    Lennon     Guitar The Beatles    Come Together
## 3  Paul McCartney       Bass The Beatles   Hello, Goodbye
## # ... with 1 more variables: album <chr>
# Create goal2 using full_join() and inner_join() 
head(artists)
## # A tibble: 6 x 3
##    first     last instrument
##    <chr>    <chr>      <chr>
## 1  Jimmy  Buffett     Guitar
## 2 George Harrison     Guitar
## 3   Mick   Jagger     Vocals
## 4    Tom    Jones     Vocals
## 5   Davy    Jones     Vocals
## 6   John   Lennon     Guitar
head(bands)
## # A tibble: 6 x 3
##       first     last         band
##       <chr>    <chr>        <chr>
## 1      John   Bonham Led Zeppelin
## 2 John Paul    Jones Led Zeppelin
## 3     Jimmy     Page Led Zeppelin
## 4    Robert    Plant Led Zeppelin
## 5    George Harrison  The Beatles
## 6      John   Lennon  The Beatles
head(songs)
## # A tibble: 4 x 4
##               song                album  first      last
##              <chr>                <chr>  <chr>     <chr>
## 1    Come Together           Abbey Road   John    Lennon
## 2         Dream On            Aerosmith Steven     Tyler
## 3   Hello, Goodbye Magical Mystery Tour   Paul McCartney
## 4 It's Not Unusual     Along Came Jones    Tom     Jones
goal2 <- artists %>%
  full_join(bands, by = c("first","last")) %>%
  inner_join(songs, by = c("first","last"))
  
# Check that goal and goal2 are the same
setequal(goal, goal2)
## TRUE

Choose your joins

# Create one table that combines all information
collection <- artists %>%
  full_join(bands, by = c("first","last")) %>%
  full_join(songs, by = c("first","last")) %>%
  full_join(albums, by = c("album", "band"))
  
collection
## # A tibble: 29 x 7
##     first      last instrument               band             song
##     <chr>     <chr>      <chr>              <chr>            <chr>
##  1  Jimmy   Buffett     Guitar  The Coral Reefers             <NA>
##  2 George  Harrison     Guitar        The Beatles             <NA>
##  3   Mick    Jagger     Vocals The Rolling Stones             <NA>
##  4    Tom     Jones     Vocals               <NA> It's Not Unusual
##  5   Davy     Jones     Vocals               <NA>             <NA>
##  6   John    Lennon     Guitar        The Beatles    Come Together
##  7   Paul McCartney       Bass        The Beatles   Hello, Goodbye
##  8  Jimmy      Page     Guitar       Led Zeppelin             <NA>
##  9    Joe     Perry     Guitar               <NA>             <NA>
## 10  Elvis   Presley     Vocals               <NA>             <NA>
## # ... with 19 more rows, and 2 more variables: album <chr>, year <int>

   


Filtering joins and set operations


  • Filtering joins return a copy of the dataset that has been filtered, not augmented (as with mutating joins)

Semi-joins

Apply a semi-join

# View the output of semi_join()
artists %>% 
  semi_join(songs, by = c("first", "last"))
## # A tibble: 3 x 3
##   first      last instrument
##   <chr>     <chr>      <chr>
## 1  John    Lennon     Guitar
## 2  Paul McCartney       Bass
## 3   Tom     Jones     Vocals
# Create the same result
artists %>% 
  right_join(songs, by = c("first", "last")) %>% 
  filter(!is.na(instrument)) %>% 
  select(first, last, instrument)
## # A tibble: 3 x 3
##   first      last instrument
##   <chr>     <chr>      <chr>
## 1  John    Lennon     Guitar
## 2  Paul McCartney       Bass
## 3   Tom     Jones     Vocals

Exploring with semi-joins

albums %>% 
  # Collect the albums made by a band
  semi_join(bands, by = "band") %>% 
  # Count the albums made by a band
  nrow()
## [1] 5

A more precise way to filter?

  • You can use semi_joins with a df of date you want to match to easily filter
tracks
## # A tibble: 21 x 6
##                               track                band      label  year
##                               <chr>               <chr>      <chr> <int>
##  1                Can't Buy Me Love         The Beatles Parlophone  1964
##  2                      I Feel Fine         The Beatles Parlophone  1964
##  3               A Hard Day's Night         The Beatles Parlophone  1964
##  4                 Sound of Silence Simon and Garfunkel   Columbia  1964
##  5                            Help!         The Beatles Parlophone  1965
##  6                   Ticket to Ride         The Beatles Parlophone  1965
##  7                      I am a Rock Simon and Garfunkel   Columbia  1965
##  8 Yellow Submarine / Eleanor Rigby         The Beatles Parlophone  1966
##  9                   Homeward Bound Simon and Garfunkel   Columbia  1966
## 10                 Scarborough Fair Simon and Garfunkel   Columbia  1966
## # ... with 11 more rows, and 2 more variables: first <chr>, last <chr>
matches
## # A tibble: 3 x 3
##                  band  year first
##                 <chr> <int> <chr>
## 1         The Beatles  1964  Paul
## 2         The Beatles  1965  John
## 3 Simon and Garfunkel  1966  Paul
# With semi-join
tracks %>% 
  semi_join(matches,by = c("band", "year", "first"))
## # A tibble: 4 x 6
##               track                band      label  year first      last
##               <chr>               <chr>      <chr> <int> <chr>     <chr>
## 1 Can't Buy Me Love         The Beatles Parlophone  1964  Paul McCartney
## 2             Help!         The Beatles Parlophone  1965  John    Lennon
## 3    Ticket to Ride         The Beatles Parlophone  1965  John    Lennon
## 4    Homeward Bound Simon and Garfunkel   Columbia  1966  Paul     Simon
# With dply filter statement
tracks %>% 
  filter(
    (band == "The Beatles" & year == 1964 & first == "Paul") |
    (band == "The Beatles" & year == 1965 & first == "John") |
    (band == "Simon and Garfunkel" & year == 1966 & first == "Paul")
  )
## # A tibble: 4 x 6
##               track                band      label  year first      last
##               <chr>               <chr>      <chr> <int> <chr>     <chr>
## 1 Can't Buy Me Love         The Beatles Parlophone  1964  Paul McCartney
## 2             Help!         The Beatles Parlophone  1965  John    Lennon
## 3    Ticket to Ride         The Beatles Parlophone  1965  John    Lennon
## 4    Homeward Bound Simon and Garfunkel   Columbia  1966  Paul     Simon

Anti-joins

Apply an anti-join

# Return rows of artists that don't have bands info
artists %>% 
  anti_join(bands, by = c("first","last"))
## # A tibble: 8 x 3
##   first    last instrument
##   <chr>   <chr>      <chr>
## 1 Nancy  Wilson     Vocals
## 2 Brian  Wilson     Vocals
## 3   Joe   Walsh     Guitar
## 4  Paul   Simon     Guitar
## 5 Elvis Presley     Vocals
## 6   Joe   Perry     Guitar
## 7  Davy   Jones     Vocals
## 8   Tom   Jones     Vocals

Apply another anti-join

# Check whether album names in labels are mis-entered
glimpse(labels)
## Observations: 9
## Variables: 2
## $ album <chr> "Abbey Road", "A Hard Days Night", "Magical Mystery Tour...
## $ label <chr> "Apple", "Parlophone", "Parlophone", "Atlantic", "Harves...
glimpse(albums)
## Observations: 9
## Variables: 3
## $ album <chr> "A Hard Day's Night", "Magical Mystery Tour", "Beggar's ...
## $ band  <chr> "The Beatles", "The Beatles", "The Rolling Stones", "The...
## $ year  <int> 1964, 1967, 1968, 1969, 1971, 1973, 1973, 1977, 1982
labels %>% 
  anti_join(albums, by = c("album"))
## # A tibble: 1 x 2
##               album      label
##               <chr>      <chr>
## 1 A Hard Days Night Parlophone

Which filtering join?

# Determine which key joins labels and songs
labels
## # A tibble: 9 x 2
##                       album           label
##                       <chr>           <chr>
## 1                Abbey Road           Apple
## 2         A Hard Days Night      Parlophone
## 3      Magical Mystery Tour      Parlophone
## 4           Led Zeppelin IV        Atlantic
## 5 The Dark Side of the Moon         Harvest
## 6          Hotel California          Asylum
## 7                   Rumours Warner Brothers
## 8                 Aerosmith        Columbia
## 9          Beggar's Banquet           Decca
songs
## # A tibble: 4 x 4
##               song                album  first      last
##              <chr>                <chr>  <chr>     <chr>
## 1    Come Together           Abbey Road   John    Lennon
## 2         Dream On            Aerosmith Steven     Tyler
## 3   Hello, Goodbye Magical Mystery Tour   Paul McCartney
## 4 It's Not Unusual     Along Came Jones    Tom     Jones
# Check your understanding
songs %>% 
  # Find the rows of songs that match a row in labels
  semi_join(labels, by = c("album")) %>% 
  # Number of matches between labels and songs
  nrow()
## [1] 3

Set operations

  • union will return every row that appears in one or more of the datasets
    • If a row appears multiple times union will only return it once
  • interesect will return only the rows that appear in both datasets
  • setdiff will return the rows that appear in the first dataset but not the second

How many songs are there?

  • How many unique songs do these two albums contain in total?
head(aerosmith)
## # A tibble: 6 x 2
##                song     length
##               <chr>     <time>
## 1           Make It 13260 secs
## 2          Somebody 13500 secs
## 3          Dream On 16080 secs
## 4    One Way Street 25200 secs
## 5          Mama Kin 15900 secs
## 6 Write me a Letter 15060 secs
head(greatest_hits)
## # A tibble: 6 x 2
##                      song     length
##                     <chr>     <time>
## 1                Dream On 16080 secs
## 2                Mama Kin 16020 secs
## 3 Same Old Song and Dance 11040 secs
## 4       Seasons of Winter 17820 secs
## 5           Sweet Emotion 11700 secs
## 6           Walk this Way 12780 secs
aerosmith %>% 
  # Create the new dataset using a set operation
  union(greatest_hits) %>% 
  # Count the total number of songs
  nrow()
## [1] 24

Greatest hits

  • Which songs from Aerosmith made it onto Greates Hits?
# Create the new dataset using a set operation
aerosmith %>% 
  intersect(greatest_hits)
## # A tibble: 1 x 2
##       song     length
##      <chr>     <time>
## 1 Dream On 16080 secs

Live! Bootleg songs

  • Which songs are on Live! Bootleg but not on Greatest Hits? Notice that the length of songs may be different when they are performed live.
# Select the song names from live
live_songs <- live %>% 
  select(song)

# Select the song names from greatest_hits
greatest_songs <- greatest_hits %>% 
  select(song)

# Create the new dataset using a set operation
live_songs %>% 
  setdiff(greatest_songs)
## # A tibble: 8 x 1
##                                          song
##                                         <chr>
## 1                          Lord of the Thighs
## 2                           Toys in the Attic
## 3                               Sick as a Dog
## 4                         Sight for Sore Eyes
## 5                            S.O.S. (Too Bad)
## 6                             I Ain't Got You
## 7                Mother Popcorn/Draw the Line
## 8 Train Kept A-Rollin'/Strangers in the Night

Multiple operations

  • Which songs appear on one of Live! Bootleg or Greatest Hits, but not both?
# Select songs from live and greatest_hits
live_songs <- live %>% select(song)
greatest_songs <- greatest_hits %>% select(song)

# Return the songs that only exist in one dataset
all_songs <- union(live_songs, greatest_songs)
both_songs <- intersect(live_songs, greatest_songs)
one_songs <- setdiff(all_songs, both_songs)
one_songs
## # A tibble: 17 x 1
##                                           song
##                                          <chr>
##  1                       One Way Street (live)
##  2                       Sweet Emotion (remix)
##  3                           Lightning Strikes
##  4              Remember (Walking in the Sand)
##  5                            Kings and Queens
##  6                               Draw the Line
##  7                         Big Ten Inch Record
##  8                           Seasons of Winter
##  9                     Same Old Song and Dance
## 10 Train Kept A-Rollin'/Strangers in the Night
## 11                Mother Popcorn/Draw the Line
## 12                             I Ain't Got You
## 13                            S.O.S. (Too Bad)
## 14                         Sight for Sore Eyes
## 15                               Sick as a Dog
## 16                           Toys in the Attic
## 17                          Lord of the Thighs

Comparing datasets

  • Its common to want to know if one data set is the same as another dataset
  • dplyr’s setequal will do this easily
  • base R’s identical is will only return true if the datasets have the exact same rows in the exact same order

Recap:

  • Mutating Joins:
    • left_join
    • right_join
    • inner_join
    • full_join
  • Filtering Joins:
    • semi_join
    • anti_join
  • Set Operations:
    • union
    • intersect
    • setdiff
  • Comparisions
    • setequal

Apply setequal

  • Both anthologies claim to contain the complete studio recordings of Led Zeppelin, but do the anthologies contain the same exact songs?
# Check if same order: definitive and complete
identical(definitive, complete)
## [1] FALSE
# Check if any order: definitive and complete
setequal(definitive, complete)
## FALSE: Different number of rows
# Songs in definitive but not complete
setdiff(definitive, complete)
## # A tibble: 15 x 2
##                           song                     album
##                          <chr>                     <chr>
##  1               Rock and Roll The Song Remains the Same
##  2             Celebration Day The Song Remains the Same
##  3                   Black Dog The Song Remains the Same
##  4 Over the Hills and Far Away The Song Remains the Same
##  5          Misty Mountain Hop The Song Remains the Same
##  6  Since I've Been Loving You The Song Remains the Same
##  7                  No Quarter The Song Remains the Same
##  8   The Song Remains the Same The Song Remains the Same
##  9               The Rain Song The Song Remains the Same
## 10                   The Ocean The Song Remains the Same
## 11          Dazed and Confused The Song Remains the Same
## 12          Stairway to Heaven The Song Remains the Same
## 13                   Moby Dick The Song Remains the Same
## 14                Heartbreaker The Song Remains the Same
## 15            Whole Lotta Love The Song Remains the Same
# Songs in complete but not definitive
setdiff(complete, definitive)
## # A tibble: 0 x 2
## # ... with 2 variables: song <chr>, album <chr>

Apply setequal again

  • If the datasets have the same columns and all are used in the key, then intersect is analagous to semi_join
  • In the same scenario setdiff is similar to anti_join
# Return songs in definitive that are not in complete
definitive %>% 
  anti_join(complete, by = c("song", "album"))
## # A tibble: 15 x 2
##                           song                     album
##                          <chr>                     <chr>
##  1            Whole Lotta Love The Song Remains the Same
##  2                Heartbreaker The Song Remains the Same
##  3                   Moby Dick The Song Remains the Same
##  4          Stairway to Heaven The Song Remains the Same
##  5          Dazed and Confused The Song Remains the Same
##  6                   The Ocean The Song Remains the Same
##  7               The Rain Song The Song Remains the Same
##  8   The Song Remains the Same The Song Remains the Same
##  9                  No Quarter The Song Remains the Same
## 10  Since I've Been Loving You The Song Remains the Same
## 11          Misty Mountain Hop The Song Remains the Same
## 12 Over the Hills and Far Away The Song Remains the Same
## 13                   Black Dog The Song Remains the Same
## 14             Celebration Day The Song Remains the Same
## 15               Rock and Roll The Song Remains the Same
# Return songs in complete that are not in definitive
complete %>% 
  anti_join(definitive, by = c("song", "album"))
## # A tibble: 0 x 2
## # ... with 2 variables: song <chr>, album <chr>

Comparing albums

  • Is the soundtrack of The Song Remains the Same the only difference between The Definitive Collection and The Complete Studio Recordings?
nrow(complete)
## [1] 85
nrow(definitive)
## [1] 100
# Check if same order: definitive and union of complete and soundtrack
union(complete, soundtrack) %>%
  identical(definitive)
## [1] FALSE
# Check if any order: definitive and union of complete and soundtrack
union(complete, soundtrack) %>%
  setequal(definitive)
## TRUE

   


Assembling data


Binds

  • Base R binds
    • rbind, cbind
  • dplyr binds
    • bind_rows, bind_cols

 

  • Advantages of dplyr versions
    • faster
    • return a tibble (advanced data frame)
    • can handle lists of dataframes
    • .id argument in bind_rows allows you to pass in a name for each source dataframe
    • this will be added as a column to the result dataset
    • this is really handy when one of the variables for the data set is stored in the table name.
    • e.g. houly weather data for different days
    • rbind will return an error if the column names do not match exactly.
    • bind_rows will create a column for each unique column and distribute NAs appropriately

Which bind?

# Examine side_one and side_two
side_one
## # A tibble: 5 x 2
##                       song     length
##                      <chr>     <time>
## 1              Speak to Me  5400 secs
## 2                  Breathe  9780 secs
## 3               On the Run 12600 secs
## 4                     Time 24780 secs
## 5 The Great Gig in the Sky 15300 secs
side_two
## # A tibble: 5 x 2
##                  song     length
##                 <chr>     <time>
## 1               Money 23400 secs
## 2         Us and Them 28260 secs
## 3 Any Colour You Like 12240 secs
## 4        Brain Damage 13800 secs
## 5             Eclipse  7380 secs
# Bind side_one and side_two into a single dataset
side_one %>% 
  bind_rows(side_two)
## # A tibble: 10 x 2
##                        song     length
##                       <chr>     <time>
##  1              Speak to Me  5400 secs
##  2                  Breathe  9780 secs
##  3               On the Run 12600 secs
##  4                     Time 24780 secs
##  5 The Great Gig in the Sky 15300 secs
##  6                    Money 23400 secs
##  7              Us and Them 28260 secs
##  8      Any Colour You Like 12240 secs
##  9             Brain Damage 13800 secs
## 10                  Eclipse  7380 secs

Bind rows

  • you can pass bind_rows() a list of data frames like jimi to bind together into a single data frame
  • you can save the name of the list element into a column with the .id argument
# Examine discography and jimi
discography
## # A tibble: 3 x 2
##                 album  year
##                 <chr> <int>
## 1 Are You Experienced  1967
## 2  Axis: Bold as Love  1967
## 3   Electric Ladyland  1968
jimi
## $`Are You Experienced`
## # A tibble: 10 x 2
##                        song     length
##                       <chr>     <time>
##  1              Purple Haze  9960 secs
##  2         Manic Depression 13560 secs
##  3                  Hey Joe 12180 secs
##  4         May This Be Love 11640 secs
##  5       I Don't Live Today 14100 secs
##  6      The Wind Cries Mary 12060 secs
##  7                     Fire  9240 secs
##  8 Third Stone from the Sun 24000 secs
##  9                Foxy Lady 11700 secs
## 10     Are You Experienced? 14100 secs
## 
## $`Axis: Bold As Love`
## # A tibble: 13 x 2
##                    song     length
##                   <chr>     <time>
##  1                  EXP  6900 secs
##  2    Up from the Skies 10500 secs
##  3 Spanish Castle Magic 10800 secs
##  4  Wait Until Tomorrow 10800 secs
##  5     Ain't No Telling  6360 secs
##  6          Little Wing  8640 secs
##  7           If 6 was 9 19920 secs
##  8   You Got Me Floatin  9900 secs
##  9 Castles Made of Sand  9960 secs
## 10        She's So Fine  9420 secs
## 11       One Rainy Wish 13200 secs
## 12    Little Miss Lover  8400 secs
## 13         Bold as Love 15060 secs
## 
## $`Electric Ladyland`
## # A tibble: 16 x 2
##                                          song     length
##                                         <chr>     <time>
##  1                     And the Gods Made Love  4860 secs
##  2  Have You Ever Been (To Electric Ladyland)  7860 secs
##  3                          Crosstown Traffic  8700 secs
##  4                               Voodoo Chile 54000 secs
##  5                        Little Miss Strange 10320 secs
##  6                      Long Hot Summer Night 12420 secs
##  7                           Come On (Part 1) 14940 secs
##  8                                 Gypsy Eyes 13380 secs
##  9               Burning of the Midnight Lamp 13140 secs
## 10                      Rainy Day, Dream Away 13320 secs
## 11     1983... (A Merman I Should Turn to Be) 49140 secs
## 12 Moon, Turn the Tides... Gently Gently Away  3720 secs
## 13              Still Raining, Still Dreaming 15900 secs
## 14                         House Burning Down 16380 secs
## 15                   All Along the Watchtower 14460 secs
## 16               Voodoo Child (Slight Return) 18720 secs
jimi %>% 
  # Bind jimi into a single data frame
  bind_rows(.id = "album") %>% 
  # Make a complete data frame
  left_join(discography, by = c("album"))
## # A tibble: 39 x 4
##                  album                     song     length  year
##                  <chr>                    <chr>     <time> <int>
##  1 Are You Experienced              Purple Haze  9960 secs  1967
##  2 Are You Experienced         Manic Depression 13560 secs  1967
##  3 Are You Experienced                  Hey Joe 12180 secs  1967
##  4 Are You Experienced         May This Be Love 11640 secs  1967
##  5 Are You Experienced       I Don't Live Today 14100 secs  1967
##  6 Are You Experienced      The Wind Cries Mary 12060 secs  1967
##  7 Are You Experienced                     Fire  9240 secs  1967
##  8 Are You Experienced Third Stone from the Sun 24000 secs  1967
##  9 Are You Experienced                Foxy Lady 11700 secs  1967
## 10 Are You Experienced     Are You Experienced? 14100 secs  1967
## # ... with 29 more rows

Bind columns

# Examine hank_years and hank_charts
hank_years
## # A tibble: 67 x 2
##     year                                    song
##    <int>                                   <chr>
##  1  1947                         Move It On Over
##  2  1947    My Love for You (Has Turned to Hate)
##  3  1947 Never Again (Will I Knock on Your Door)
##  4  1947    On the Banks of the Old Ponchartrain
##  5  1947                            Pan American
##  6  1947             Wealth Won't Save Your Soul
##  7  1948                   A Mansion on the Hill
##  8  1948                           Honky Tonkin'
##  9  1948                         I Saw the Light
## 10  1948                   I'm a Long Gone Daddy
## # ... with 57 more rows
hank_charts
## # A tibble: 67 x 2
##                               song  peak
##                              <chr> <int>
##  1 (I Heard That) Lonesome Whistle     9
##  2    (I'm Gonna) Sing, Sing, Sing    NA
##  3                A Home in Heaven    NA
##  4           A Mansion on the Hill    12
##  5            A Teardrop on a Rose    NA
##  6       At the First Fall of Snow    NA
##  7      Baby, We're Really in Love     4
##  8               California Zephyr    NA
##  9                     Calling You    NA
## 10                Cold, Cold Heart     1
## # ... with 57 more rows
hank_years %>% 
  # Reorder hank_years alphabetically by song title
  arrange(song) %>% 
  # Select just the year column
  select(year) %>% 
  # Bind the year column
  bind_cols(hank_charts) %>% 
  # Arrange the finished dataset
  arrange(year, song)
## # A tibble: 67 x 3
##     year                                    song  peak
##    <int>                                   <chr> <int>
##  1  1947                         Move It On Over     4
##  2  1947    My Love for You (Has Turned to Hate)    NA
##  3  1947 Never Again (Will I Knock on Your Door)    NA
##  4  1947    On the Banks of the Old Ponchartrain    NA
##  5  1947                            Pan American    NA
##  6  1947             Wealth Won't Save Your Soul    NA
##  7  1948                   A Mansion on the Hill    12
##  8  1948                           Honky Tonkin'    14
##  9  1948                         I Saw the Light    NA
## 10  1948                   I'm a Long Gone Daddy     6
## # ... with 57 more rows

Danger

  • The biggest risk when using bind_cols is that the observations may not align acros the rows of your data frame

Build a better data frame

  • R Base way to create a dataframe
    • data.frame() and as.data.frame()
  • dplyr way to create a dataframe
    • data_frame() and as_data_frame()
  • Advantages of using data_frame():
    • will never change the data type of a vector. (e.g. strings to factors)
    • will never add row names
    • will not changes unusual column names
    • only recycles length 1 inputs
    • evaluate lazily and in order. (So you can reference a column in the next column definition)
    • outputs a tibble (class tbl_df)
  • as_data_frame does all these same things
    • its just used on a list of columns

Make a data frame

head(hank_year)
## [1] 1947 1947 1947 1947 1947 1947
head(hank_song)
## [1] "Move It On Over"                        
## [2] "My Love for You (Has Turned to Hate)"   
## [3] "Never Again (Will I Knock on Your Door)"
## [4] "On the Banks of the Old Ponchartrain"   
## [5] "Pan American"                           
## [6] "Wealth Won't Save Your Soul"
head(hank_peak)
## [1]  4 NA NA NA NA NA
# Make combined data frame using data_frame()
data_frame(
  year = hank_year,
  song = hank_song,
  peak = hank_peak
  ) %>% 
  # Extract songs where peak equals 1
  filter(peak == 1)
## # A tibble: 11 x 3
##     year                                   song  peak
##    <int>                                  <chr> <int>
##  1  1949                         Lovesick Blues     1
##  2  1950               Long Gone Lonesome Blues     1
##  3  1950                      Moanin' the Blues     1
##  4  1950                  Why Don't You Love Me     1
##  5  1951                       Cold, Cold Heart     1
##  6  1951                       Hey Good Lookin'     1
##  7  1952 I'll Never Get Out of This World Alive     1
##  8  1952               Jambalaya (On the Bayou)     1
##  9  1953                               Kaw-Liga     1
## 10  1953        Take These Chains from My Heart     1
## 11  1953                    Your Cheatin' Heart     1

Lists of columns

# Examine the contents of hank
str(hank)
## List of 3
##  $ year: int [1:67] 1947 1947 1947 1947 1947 1947 1948 1948 1948 1948 ...
##  $ song: chr [1:67] "Move It On Over" "My Love for You (Has Turned to Hate)" "Never Again (Will I Knock on Your Door)" "On the Banks of the Old Ponchartrain" ...
##  $ peak: int [1:67] 4 NA NA NA NA NA 12 14 NA 2 ...
# Convert the hank list into a data frame
as_data_frame(hank) %>% 
  # Extract songs where peak equals 1
  filter(peak == 1)
## # A tibble: 11 x 3
##     year                                   song  peak
##    <int>                                  <chr> <int>
##  1  1949                         Lovesick Blues     1
##  2  1950               Long Gone Lonesome Blues     1
##  3  1950                      Moanin' the Blues     1
##  4  1950                  Why Don't You Love Me     1
##  5  1951                       Cold, Cold Heart     1
##  6  1951                       Hey Good Lookin'     1
##  7  1952 I'll Never Get Out of This World Alive     1
##  8  1952               Jambalaya (On the Bayou)     1
##  9  1953                               Kaw-Liga     1
## 10  1953        Take These Chains from My Heart     1
## 11  1953                    Your Cheatin' Heart     1

List of rows (data frames)

# Examine the contents of michael
str(michael, give.attr = F)
## List of 10
##  $ Got to Be There :Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "Ain'T No Sunshine" "I Wanna be Where You Are" "Girl Don't Take Your Love from Me" "In Our Small Way" ...
##   ..$ peak: int [1:10] NA NA NA NA 4 2 NA NA NA NA
##  $ Ben             :Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "Ben" "Greatest Show On Earth" "People Make the World Go 'Round" "We've Got a Good Thing" ...
##   ..$ peak: int [1:10] 1 NA NA NA NA NA NA NA NA NA
##  $ Music & Me      :Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "With a Child's Heart" "Up Again" "All the Things You Are" "Happy" ...
##   ..$ peak: int [1:10] 50 NA NA NA NA NA NA NA NA NA
##  $ Forever, Michael:Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "We're Almost There" "Take Me Back" "One Day in Your Life" "Cinderella Stay Awhile" ...
##   ..$ peak: int [1:10] 54 NA NA NA NA 23 NA NA NA NA
##  $ Off the Wall    :Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "Don't Stop 'Til You Get Enough" "Rock with You" "Working Day and Night" "Get on the Floor" ...
##   ..$ peak: int [1:10] 1 1 NA NA 10 NA 10 NA NA NA
##  $ Thriller        :Classes 'tbl_df', 'tbl' and 'data.frame':    9 obs. of  2 variables:
##   ..$ song: chr [1:9] "Wanna Be Startin' Somethin" "Baby be Mine" "The Girl is Mine" "Thriller" ...
##   ..$ peak: int [1:9] 5 NA 2 4 1 1 7 10 NA
##  $ Bad             :Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "Bad" "The Way You Make Me Feel" "Speed Demon" "Liberian Girl" ...
##   ..$ peak: int [1:10] 1 1 NA NA NA 11 1 1 1 7
##  $ Dangerous       :Classes 'tbl_df', 'tbl' and 'data.frame':    14 obs. of  2 variables:
##   ..$ song: chr [1:14] "Jam" "Why You Wanna Trip on Me" "In the Closet" "She Drives Me Wild" ...
##   ..$ peak: int [1:14] 26 NA 6 NA 3 NA 27 1 14 NA ...
##  $ HIStory         :Classes 'tbl_df', 'tbl' and 'data.frame':    30 obs. of  2 variables:
##   ..$ song: chr [1:30] "Billie Jean" "The Way You Make Me Feel" "Black or White" "Rock with You" ...
##   ..$ peak: int [1:30] NA NA NA NA NA NA NA NA NA NA ...
##  $ Invincible      :Classes 'tbl_df', 'tbl' and 'data.frame':    16 obs. of  2 variables:
##   ..$ song: chr [1:16] "Unbreakable" "Heartbreaker" "Invincible" "Break of Dawn" ...
##   ..$ peak: int [1:16] NA NA NA NA NA 10 NA NA NA NA ...
bind_rows(michael, .id = "album") %>% 
  group_by(album) %>% 
  mutate(rank = min_rank(peak)) %>% 
  filter(rank == 1) %>% 
  select(-rank, -peak)
## # A tibble: 16 x 2
## # Groups:   album [10]
##               album                           song
##               <chr>                          <chr>
##  1  Got to Be There                  Rockin' Robin
##  2              Ben                            Ben
##  3       Music & Me           With a Child's Heart
##  4 Forever, Michael       Just a Little Bit of You
##  5     Off the Wall Don't Stop 'Til You Get Enough
##  6     Off the Wall                  Rock with You
##  7         Thriller                        Beat It
##  8         Thriller                    Billie Jean
##  9              Bad                            Bad
## 10              Bad       The Way You Make Me Feel
## 11              Bad              Man in the Mirror
## 12              Bad   I Just Can't Stop Loving You
## 13              Bad                    Dirty Diana
## 14        Dangerous                 Black or White
## 15          HIStory              You Are Not Alone
## 16       Invincible              You Rock My World

Working with data types

  • Be aware of data types when combining vectors or data frames
    • R will need to make a decision about the resulting type if the original types differ for a vector/column
  • 6 atomic data types in R
    • logical
    • character
    • double
    • integer
    • complex
    • raw
typeof(TRUE)
## [1] "logical"
typeof("hello")
## [1] "character"
typeof(3.14)
## [1] "double"
typeof(1L)
## [1] "integer"
typeof(1 + 2i)
## [1] "complex"
typeof(raw(1))
## [1] "raw"

Atomic data types

  • R will coerce logicals and numerics to strings if strings are present
  • R will coerce logicals to numerics if numerics are present
  • as.character() will coerce factors to their labels as strings if strings are present
  • as.numeric() will coerce factors to their levels as numerics if numerics are present

dplyr’s coercion rules

Base R coercion rules:

  • in general the more specific type of data will convert to the more general
    • integer and logical will go to integer
    • double and integer/locgical will go to double
    • character and integer/double/logical will go to character
  • factors are tricky
    • if joining to an integer or double the factor integer values will be used
    • if joining to a character the factor labels will be used
    • be careful if you have numeric labels for your factors

dplyr’s coercion rules:

  • dplyr functions will not automatically coerce data types
    • returns and error
    • expects you to manyally coerce data
  • Exception: factors
    • dplyr coverts non-aligning factors to strings
    • gives warning message

dplyr and coercion

  • dplyr’s two table functions will
    • coerce two factors to a string if the factors have different levels
    • coerce two factors to a string if the factors have the same levels in a different order
    • return an error instead of coercing logicals, integers, and numerics to a character

Determining type

  • eighties has the year variable stores as a factor
  • sevnties has it stored as a character
  • when combined it will be a character
    • we will get a warning message
seventies
## # A tibble: 10 x 3
##      year                      album                 band
##    <fctr>                      <chr>                <chr>
##  1   1970 Bridge Over Troubled Water  Simon and Garfunkel
##  2   1971     Jesus Christ Superstar      Various Artists
##  3   1972                    Harvest           Neil Young
##  4   1973      The World is a Ghetto                  War
##  5   1974  Goodbye Yellow Brick Road           Elton John
##  6   1975 Elton John's Greatest Hits           Elton John
##  7   1976             Peter Frampton Frampton Comes Alive
##  8   1977                    Rumours        Fleetwood Mac
##  9   1978       Saturday Night Fever             Bee Gees
## 10   1979                 Billy Joel          52nd Street
eighties
## # A tibble: 10 x 3
##     year              album              band
##    <chr>              <chr>             <chr>
##  1  1980           The Wall        Pink Floyd
##  2  1981      Hi Infidelity    REO Speedwagon
##  3  1982               Asia              Asia
##  4  1983           Thriller   Michael Jackson
##  5  1984           Thriller   Michael Jackson
##  6  1985 Born in the U.S.A. Bruce Springsteen
##  7  1986    Whitney Houston   Whitney Houston
##  8  1987  Slippery When Wet          Bon Jovi
##  9  1988              Faith    George Michael
## 10  1989     Don't Be Cruel       Bobby Brown
both <- seventies %>% bind_rows(eighties)
## Warning in bind_rows_(x, .id): binding factor and character vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
head(both)
## # A tibble: 6 x 3
##    year                      album                band
##   <chr>                      <chr>               <chr>
## 1  1970 Bridge Over Troubled Water Simon and Garfunkel
## 2  1971     Jesus Christ Superstar     Various Artists
## 3  1972                    Harvest          Neil Young
## 4  1973      The World is a Ghetto                 War
## 5  1974  Goodbye Yellow Brick Road          Elton John
## 6  1975 Elton John's Greatest Hits          Elton John

Results

# sixties$year is a numeric and seventies$year is a factor
str(sixties)
## Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  3 variables:
##  $ year : num  1960 1961 1962 1963 1964 ...
##  $ album: chr  "The Sound of Music" "Camelot" "West Side Story" "West Side Story" ...
##  $ band : chr  "Original Broadway Cast" "Original Broadway Cast" "Soundtrack" "Soundtrack" ...
str(seventies)
## Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  3 variables:
##  $ year : Factor w/ 10 levels "1970","1971",..: 1 2 3 4 5 6 7 8 9 10
##  $ album: chr  "Bridge Over Troubled Water" "Jesus Christ Superstar" "Harvest" "The World is a Ghetto" ...
##  $ band : chr  "Simon and Garfunkel" "Various Artists" "Neil Young" "War" ...
# This gives an error about not converting factor to numeric
seventies %>% bind_rows(sixties)
## Error in bind_rows_(x, .id): Column `year` can't be converted from factor to numeric
# So lets convert it....
# But this converts the year to the integer factor value, not the year. Whoops
seventies %>% 
  mutate(year = as.numeric(year))
## # A tibble: 10 x 3
##     year                      album                 band
##    <dbl>                      <chr>                <chr>
##  1     1 Bridge Over Troubled Water  Simon and Garfunkel
##  2     2     Jesus Christ Superstar      Various Artists
##  3     3                    Harvest           Neil Young
##  4     4      The World is a Ghetto                  War
##  5     5  Goodbye Yellow Brick Road           Elton John
##  6     6 Elton John's Greatest Hits           Elton John
##  7     7             Peter Frampton Frampton Comes Alive
##  8     8                    Rumours        Fleetwood Mac
##  9     9       Saturday Night Fever             Bee Gees
## 10    10                 Billy Joel          52nd Street
# We need to make it a character first to get the year, then convert to numeric
seventies %>% mutate(year = as.numeric(as.character(year)))
## # A tibble: 10 x 3
##     year                      album                 band
##    <dbl>                      <chr>                <chr>
##  1  1970 Bridge Over Troubled Water  Simon and Garfunkel
##  2  1971     Jesus Christ Superstar      Various Artists
##  3  1972                    Harvest           Neil Young
##  4  1973      The World is a Ghetto                  War
##  5  1974  Goodbye Yellow Brick Road           Elton John
##  6  1975 Elton John's Greatest Hits           Elton John
##  7  1976             Peter Frampton Frampton Comes Alive
##  8  1977                    Rumours        Fleetwood Mac
##  9  1978       Saturday Night Fever             Bee Gees
## 10  1979                 Billy Joel          52nd Street
# Now we can join the datasets correctly
seventies %>% 
  # Coerce seventies$year into a useful numeric
  mutate(year = as.numeric(as.character(year))) %>% 
  # Bind the updated version of seventies to sixties
  bind_rows(sixties) %>% 
  arrange(year)
## # A tibble: 20 x 3
##     year                          album                            band
##    <dbl>                          <chr>                           <chr>
##  1  1960             The Sound of Music          Original Broadway Cast
##  2  1961                        Camelot          Original Broadway Cast
##  3  1962                West Side Story                      Soundtrack
##  4  1963                West Side Story                      Soundtrack
##  5  1964                  Hello, Dolly!          Original Broadway Cast
##  6  1965                   Mary Poppins                      Soundtrack
##  7  1966 Whipped Cream & Other Delights Herb Alpert & The Tijuana Brass
##  8  1967            More of The Monkees                     The Monkees
##  9  1968           Are You Experienced?     The Jimi Hendrix Experience
## 10  1969             In-A-Gadda-Da-Vida                  Iron Butterfly
## 11  1970     Bridge Over Troubled Water             Simon and Garfunkel
## 12  1971         Jesus Christ Superstar                 Various Artists
## 13  1972                        Harvest                      Neil Young
## 14  1973          The World is a Ghetto                             War
## 15  1974      Goodbye Yellow Brick Road                      Elton John
## 16  1975     Elton John's Greatest Hits                      Elton John
## 17  1976                 Peter Frampton            Frampton Comes Alive
## 18  1977                        Rumours                   Fleetwood Mac
## 19  1978           Saturday Night Fever                        Bee Gees
## 20  1979                     Billy Joel                     52nd Street

   


Advanced Joining


What can go wrong?

  • This chapter is about becoming a master at joining data
  • The main problems with joins come from missing or duplicate key values or columns
  • NA in the key column?
    • Probably just need to remove these rows
  • Missing key column?
    • it might be in the row name. use rownames_to_column in tibble package
  • duplicate keys?
    • Might just need to include another column in the key criteria
    • If you really have duplicates be careful as you will get every combination of key value matches across datasets.

Spot the key

# Whoops, the primary key is in the row names for stage_songs
# We need to fix that before joining to stage_writers
stage_songs
##                                     musical year
## Children Will Listen         Into the Woods 1986
## Maria                       West Side Story 1957
## Memory                                 Cats 1981
## The Music of the Night Phantom of the Opera 1986
stage_writers
##                     song            composer
## 1   Children Will Listen    Stephen Sondheim
## 2                  Maria   Leonard Bernstein
## 3                 Memory Andrew Lloyd Webber
## 4 The Music of the Night Andrew Lloyd Webber
# Load the tibble package
library(tibble)

stage_songs %>% 
  # Add row names as a column named song
  rownames_to_column("song") %>% 
  # Left join stage_writers to stage_songs
  left_join(stage_writers, by = "song")
##                     song              musical year            composer
## 1   Children Will Listen       Into the Woods 1986    Stephen Sondheim
## 2                  Maria      West Side Story 1957   Leonard Bernstein
## 3                 Memory                 Cats 1981 Andrew Lloyd Webber
## 4 The Music of the Night Phantom of the Opera 1986 Andrew Lloyd Webber

Non-unique keys

  • Non unique keys can be okay if you want the info for each entry and don’t mind it being duplicated.
    • like years will get duplicated here because there are non unique keys in the musical column of composers dataset
shows
## # A tibble: 6 x 2
##                musical  year
##                  <chr> <int>
## 1   The Sound of Music  1959
## 2       Into the Woods  1986
## 3       The King and I  1951
## 4      West Side Story  1957
## 5                 Cats  1981
## 6 Phantom of the Opera  1986
# There are more than one composer for some musicals, so the musical column has duplicate key values. 
composers
## # A tibble: 8 x 2
##                musical             composer
##                  <chr>                <chr>
## 1                 Cats  Andrew Lloyd Webber
## 2       Into the Woods     Stephen Sondheim
## 3 Phantom of the Opera  Andrew Lloyd Webber
## 4       The King and I       Richard Rogers
## 5       The King and I Oscar Hammerstein II
## 6   The Sound of Music       Richard Rogers
## 7   The Sound of Music Oscar Hammerstein II
## 8      West Side Story    Leonard Bernstein
# join by musical
shows %>% left_join(composers, by = "musical")
## # A tibble: 8 x 3
##                musical  year             composer
##                  <chr> <int>                <chr>
## 1   The Sound of Music  1959       Richard Rogers
## 2   The Sound of Music  1959 Oscar Hammerstein II
## 3       Into the Woods  1986     Stephen Sondheim
## 4       The King and I  1951       Richard Rogers
## 5       The King and I  1951 Oscar Hammerstein II
## 6      West Side Story  1957    Leonard Bernstein
## 7                 Cats  1981  Andrew Lloyd Webber
## 8 Phantom of the Opera  1986  Andrew Lloyd Webber

Two non-unique keys

  • But in this case its kind of a problem becuse the key is duplicated in both show_songs and composers datasets
    • There are multiple songs in a musical and multiple composers for a musical
    • We end up with two rows for every song if there were two composers for that musical
show_songs
## # A tibble: 8 x 3
##                          song              musical  year
##                         <chr>                <chr> <int>
## 1 A Few of My Favorite Things   The Sound of Music  1959
## 2        Children Will Listen       Into the Woods  1986
## 3                   Edelweiss   The Sound of Music  1959
## 4         Getting to Know You       The King and I  1951
## 5                       Maria      West Side Story  1957
## 6                      Memory                 Cats  1981
## 7             Shall We Dance?       The King and I  1951
## 8      The Music of the Night Phantom of the Opera  1986
show_songs %>% 
  left_join(composers, by = "musical") %>%
  arrange(musical) %>% 
  head(12)
## # A tibble: 12 x 4
##                           song              musical  year
##                          <chr>                <chr> <int>
##  1                      Memory                 Cats  1981
##  2        Children Will Listen       Into the Woods  1986
##  3      The Music of the Night Phantom of the Opera  1986
##  4         Getting to Know You       The King and I  1951
##  5         Getting to Know You       The King and I  1951
##  6             Shall We Dance?       The King and I  1951
##  7             Shall We Dance?       The King and I  1951
##  8 A Few of My Favorite Things   The Sound of Music  1959
##  9 A Few of My Favorite Things   The Sound of Music  1959
## 10                   Edelweiss   The Sound of Music  1959
## 11                   Edelweiss   The Sound of Music  1959
## 12                       Maria      West Side Story  1957
## # ... with 1 more variables: composer <chr>

Missing keys

  • Missing keys can mistakenly join two rows that do not belong toegether
singers
## # A tibble: 2 x 2
##                movie                singer
##                <chr>                 <chr>
## 1               <NA> Arnold Schwarzenegger
## 2 The Sound of Music         Julie Andrews
two_songs
## # A tibble: 2 x 2
##                  song              movie
##                 <chr>              <chr>
## 1            Do-Re-Mi The Sound of Music
## 2 A Spoonful of Sugar               <NA>
# Examine the result of joining singers to two_songs
two_songs %>% inner_join(singers, by = "movie")
## # A tibble: 2 x 3
##                  song              movie                singer
##                 <chr>              <chr>                 <chr>
## 1            Do-Re-Mi The Sound of Music         Julie Andrews
## 2 A Spoonful of Sugar               <NA> Arnold Schwarzenegger
# Remove NA's from key before joining
two_songs %>% 
  filter(!is.na(movie)) %>% 
  inner_join(singers, by = "movie")
## # A tibble: 1 x 3
##       song              movie        singer
##      <chr>              <chr>         <chr>
## 1 Do-Re-Mi The Sound of Music Julie Andrews

Defining the keys

  • If you do not set a by argument dplyr will join on all columns that match across the two datasets
  • You can use named vetors to match column with different names by = c("name1"" = "name2")
  • If there are non-key columns with the same names they wil show up as name.x and name.y
    • You can change this by adding the suffix argument. suffix = c("name1","name2")

A subset of keys

  • The movie column is the key and name refers to different things in these two datasets
    • We will clean up the name columns after the join
movie_years
## # A tibble: 10 x 3
##                      movie          name  year
##                      <chr>         <chr> <int>
##  1     The Road to Morocco   Bing Crosby  1942
##  2            Going My Way   Bing Crosby  1944
##  3          Anchors Aweigh Frank Sinatra  1945
##  4 Till the Clouds Roll By Frank Sinatra  1946
##  5         White Christmas   Bing Crosby  1954
##  6         The Tender Trap Frank Sinatra  1955
##  7            High Society   Bing Crosby  1956
##  8       The Joker is Wild Frank Sinatra  1957
##  9                Pal Joey Frank Sinatra  1957
## 10                 Can-Can Frank Sinatra  1960
movie_studios
## # A tibble: 10 x 2
##                      movie                  name
##                      <chr>                 <chr>
##  1     The Road to Morocco    Paramount Pictures
##  2            Going My Way    Paramount Pictures
##  3          Anchors Aweigh   Metro-Goldwyn-Mayer
##  4 Till the Clouds Roll By   Metro-Goldwyn-Mayer
##  5         White Christmas    Paramount Pictures
##  6         The Tender Trap   Metro-Goldwyn-Mayer
##  7            High Society   Metro-Goldwyn-Mayer
##  8       The Joker is Wild    Paramount Pictures
##  9                Pal Joey     Columbia Pictures
## 10                 Can-Can Twentieth-Century Fox
movie_years %>% 
  # Left join movie_studios to movie_years
  left_join(movie_studios, by = "movie") %>% 
  # Rename the columns: artist and studio
  rename(artist = name.x, studio = name.y)
## # A tibble: 10 x 4
##                      movie        artist  year                studio
##                      <chr>         <chr> <int>                 <chr>
##  1     The Road to Morocco   Bing Crosby  1942    Paramount Pictures
##  2            Going My Way   Bing Crosby  1944    Paramount Pictures
##  3          Anchors Aweigh Frank Sinatra  1945   Metro-Goldwyn-Mayer
##  4 Till the Clouds Roll By Frank Sinatra  1946   Metro-Goldwyn-Mayer
##  5         White Christmas   Bing Crosby  1954    Paramount Pictures
##  6         The Tender Trap Frank Sinatra  1955   Metro-Goldwyn-Mayer
##  7            High Society   Bing Crosby  1956   Metro-Goldwyn-Mayer
##  8       The Joker is Wild Frank Sinatra  1957    Paramount Pictures
##  9                Pal Joey Frank Sinatra  1957     Columbia Pictures
## 10                 Can-Can Frank Sinatra  1960 Twentieth-Century Fox

Mis-matched key names

  • Here we use the named argument to create the correct join since the keys have different column names
    • This is a very common case in most relational databases. You will have something_dimension_id in the fact table and just id as the key in the dimension table
# Identify the key column
elvis_songs
## # A tibble: 5 x 2
##                                   name          movie
##                                  <chr>          <chr>
## 1 (You're So Square) Baby I Don't Care Jailhouse Rock
## 2         I Can't Help Falling in Love    Blue Hawaii
## 3                       Jailhouse Rock Jailhouse Rock
## 4                       Viva Las Vegas Viva Las Vegas
## 5                    You Don't Know Me       Clambake
elvis_movies
## # A tibble: 4 x 2
##             name  year
##            <chr> <int>
## 1 Jailhouse Rock  1957
## 2    Blue Hawaii  1961
## 3 Viva Las Vegas  1963
## 4       Clambake  1967
elvis_movies %>% 
  # Left join elvis_songs to elvis_movies by this column
  left_join(elvis_songs, by = c("name" = "movie")) %>% 
  # Rename columns
  rename(movie = name, song = name.y)
## # A tibble: 5 x 3
##            movie  year                                 song
##            <chr> <int>                                <chr>
## 1 Jailhouse Rock  1957 (You're So Square) Baby I Don't Care
## 2 Jailhouse Rock  1957                       Jailhouse Rock
## 3    Blue Hawaii  1961         I Can't Help Falling in Love
## 4 Viva Las Vegas  1963                       Viva Las Vegas
## 5       Clambake  1967                    You Don't Know Me

More mismatched names

  • We can rename columns in the select statment
# Identify the key columns
movie_directors
## # A tibble: 10 x 3
##                       name        director                studio
##                      <chr>           <chr>                 <chr>
##  1          Anchors Aweigh   George Sidney   Metro-Goldwyn-Mayer
##  2                 Can-Can     Walter Lang Twentieth-Century Fox
##  3            Going My Way     Leo McCarey    Paramount Pictures
##  4            High Society Charles Walters   Metro-Goldwyn-Mayer
##  5                Pal Joey   George Sidney     Columbia Pictures
##  6       The Joker is Wild   Charles Vidor    Paramount Pictures
##  7     The Road to Morocco    David Butler    Paramount Pictures
##  8         The Tender Trap Charles Walters   Metro-Goldwyn-Mayer
##  9 Till the Clouds Roll By   Richard Whorf   Metro-Goldwyn-Mayer
## 10         White Christmas  Michael Curtiz    Paramount Pictures
movie_years
## # A tibble: 10 x 3
##                      movie          name  year
##                      <chr>         <chr> <int>
##  1     The Road to Morocco   Bing Crosby  1942
##  2            Going My Way   Bing Crosby  1944
##  3          Anchors Aweigh Frank Sinatra  1945
##  4 Till the Clouds Roll By Frank Sinatra  1946
##  5         White Christmas   Bing Crosby  1954
##  6         The Tender Trap Frank Sinatra  1955
##  7            High Society   Bing Crosby  1956
##  8       The Joker is Wild Frank Sinatra  1957
##  9                Pal Joey Frank Sinatra  1957
## 10                 Can-Can Frank Sinatra  1960
movie_years %>% 
  # Left join movie_directors to movie_years
  left_join(movie_directors, by = c("movie" = "name")) %>% 
  # Arrange the columns using select()
  select(year, movie, artist = name, director, studio)
## # A tibble: 10 x 5
##     year                   movie        artist        director
##    <int>                   <chr>         <chr>           <chr>
##  1  1942     The Road to Morocco   Bing Crosby    David Butler
##  2  1944            Going My Way   Bing Crosby     Leo McCarey
##  3  1945          Anchors Aweigh Frank Sinatra   George Sidney
##  4  1946 Till the Clouds Roll By Frank Sinatra   Richard Whorf
##  5  1954         White Christmas   Bing Crosby  Michael Curtiz
##  6  1955         The Tender Trap Frank Sinatra Charles Walters
##  7  1956            High Society   Bing Crosby Charles Walters
##  8  1957       The Joker is Wild Frank Sinatra   Charles Vidor
##  9  1957                Pal Joey Frank Sinatra   George Sidney
## 10  1960                 Can-Can Frank Sinatra     Walter Lang
## # ... with 1 more variables: studio <chr>

Joining multiple tables

  • To join multiple tables we could do a bunch of left_joins in a row
  • The purrr package has a function reduce() that lets us apply a function recursively to a list of datasets
  • purrr is also written by hadley wickham so it will work well with dplyr functions
  • the purpose of purrr is to help us apply R functions to data in efficient ways

Join multiple tables

  • Using reduce to do a left_quick join of all tables
# Load the purrr library
library(purrr)

# Place supergroups, more_bands, and more_artists into a list
list(supergroups, more_bands, more_artists) %>% 
  # Use reduce to join together the contents of the list
  reduce(left_join, by = c("first" = "first","last" = "last"))
## # A tibble: 19 x 5
##                         supergroup   first          last
##                              <chr>   <chr>         <chr>
##  1                           Cream  Ginger         Baker
##  2                           Cream    Jack         Bruce
##  3                           Cream    Jack         Bruce
##  4                           Cream    Jack         Bruce
##  5                           Cream    Jack         Bruce
##  6                           Cream    Eric       Clapton
##  7 Crosby, Stills, Nash, and Young   David        Crosby
##  8 Crosby, Stills, Nash, and Young  Graham          Nash
##  9 Crosby, Stills, Nash, and Young Stephen        Stills
## 10 Crosby, Stills, Nash, and Young    Neil         Young
## 11                  The Highwaymen  Johnny          Cash
## 12                  The Highwaymen  Waylon      Jennings
## 13                  The Highwaymen    Kris Kristofferson
## 14                  The Highwaymen  Willie        Nelson
## 15          The Traveling Wilburys     Bob         Dylan
## 16          The Traveling Wilburys  George      Harrison
## 17          The Traveling Wilburys    Jeff         Lynne
## 18          The Traveling Wilburys     Roy       Orbison
## 19          The Traveling Wilburys     Tom         Petty
## # ... with 2 more variables: band <chr>, instrument <chr>

Filter multiple tables

  • Here we use reduce to do a filtering semi_join
list(more_artists, more_bands, supergroups) %>% 
  # Return rows of more_artists in all three datasets
  reduce(semi_join, by = c("first","last"))
## # A tibble: 11 x 3
##      first     last instrument
##      <chr>    <chr>      <chr>
##  1  Ginger    Baker      drums
##  2    Jack    Bruce       bass
##  3    Jack    Bruce       bass
##  4    Eric  Clapton     guitar
##  5   David   Crosby     guitar
##  6  Graham     Nash     guitar
##  7 Stephen   Stills     guitar
##  8    Neil    Young     guitar
##  9  George Harrison     guitar
## 10    Jeff    Lynne     vocals
## 11     Tom    Petty     vocals

Other implementations

  • We can do the same joins as we have been using here wih the merge function, but the syntax is more confusing
    • I will always just use dplyr joins
  • The dplyr joins translate to sql joins as show
    • These are good to know as often I am exploring a sql database before pulling data into R. It happens.
  • These are the DBMSs (Database management systems) you can easily connect to from R
    • Once you have connected you can write dplyr code and it will be translated to sql and run in the DBMS
    • This is incredible useful. Its worthy of its own corse.
    • checkout vignette("dbplyr") to see more about dplyr’s database capabilities
    • We can use collect to pull the data into R when we are ready or data.frame will also pull it in.

SQL

  • Each dplyr join has an SQL equivalent, and when you apply a dplyr join to a SQL data source (instead of a data frame), dplyr automatically converts your join to its SQL equivalent and runs it in the database.

Base R

# Alter the code to perform the join with a dplyr function
merge(bands, artists, by = c("first", "last"), all.x = TRUE) %>%
  arrange(band)
##        first      last               band instrument
## 1      Jimmy      Page       Led Zeppelin     Guitar
## 2       John    Bonham       Led Zeppelin       <NA>
## 3  John Paul     Jones       Led Zeppelin       <NA>
## 4     Robert     Plant       Led Zeppelin       <NA>
## 5     George  Harrison        The Beatles     Guitar
## 6       John    Lennon        The Beatles     Guitar
## 7       Paul McCartney        The Beatles       Bass
## 8      Ringo     Starr        The Beatles      Drums
## 9      Jimmy   Buffett  The Coral Reefers     Guitar
## 10   Charlie     Watts The Rolling Stones       <NA>
## 11     Keith  Richards The Rolling Stones     Guitar
## 12      Mick    Jagger The Rolling Stones     Vocals
## 13    Ronnie      Wood The Rolling Stones       <NA>
bands %>% left_join(artists, by = c("first", "last"))
## # A tibble: 13 x 4
##        first      last               band instrument
##        <chr>     <chr>              <chr>      <chr>
##  1      John    Bonham       Led Zeppelin       <NA>
##  2 John Paul     Jones       Led Zeppelin       <NA>
##  3     Jimmy      Page       Led Zeppelin     Guitar
##  4    Robert     Plant       Led Zeppelin       <NA>
##  5    George  Harrison        The Beatles     Guitar
##  6      John    Lennon        The Beatles     Guitar
##  7      Paul McCartney        The Beatles       Bass
##  8     Ringo     Starr        The Beatles      Drums
##  9     Jimmy   Buffett  The Coral Reefers     Guitar
## 10      Mick    Jagger The Rolling Stones     Vocals
## 11     Keith  Richards The Rolling Stones     Guitar
## 12   Charlie     Watts The Rolling Stones       <NA>
## 13    Ronnie      Wood The Rolling Stones       <NA>

   


Case Study


Lahman’s Baseball Database

  • library(Lahman)
  • Sean Lahman’s baseball database
  • 26 tables
  • A great example of real worl relational data

Universal keys?

  • In particular, do any variable names span all of the datasets?
library(Lahman)

# Examine lahmanNames
str(lahmanNames)
## List of 28
##  $ AllstarFull        :Classes 'tbl_df', 'tbl' and 'data.frame': 8 obs. of  1 variable:
##   ..$ var: chr [1:8] "playerID" "yearID" "gameNum" "gameID" ...
##  $ Appearances        :Classes 'tbl_df', 'tbl' and 'data.frame': 21 obs. of  1 variable:
##   ..$ var: chr [1:21] "yearID" "teamID" "lgID" "playerID" ...
##  $ AwardsManagers     :Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of  1 variable:
##   ..$ var: chr [1:6] "playerID" "awardID" "yearID" "lgID" ...
##  $ AwardsPlayers      :Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of  1 variable:
##   ..$ var: chr [1:6] "playerID" "awardID" "yearID" "lgID" ...
##  $ AwardsShareManagers:Classes 'tbl_df', 'tbl' and 'data.frame': 7 obs. of  1 variable:
##   ..$ var: chr [1:7] "awardID" "yearID" "lgID" "playerID" ...
##  $ AwardsSharePlayers :Classes 'tbl_df', 'tbl' and 'data.frame': 7 obs. of  1 variable:
##   ..$ var: chr [1:7] "awardID" "yearID" "lgID" "playerID" ...
##  $ Batting            :Classes 'tbl_df', 'tbl' and 'data.frame': 22 obs. of  1 variable:
##   ..$ var: chr [1:22] "playerID" "yearID" "stint" "teamID" ...
##  $ battingLabels      :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  1 variable:
##   ..$ var: chr [1:2] "variable" "label"
##  $ BattingPost        :Classes 'tbl_df', 'tbl' and 'data.frame': 22 obs. of  1 variable:
##   ..$ var: chr [1:22] "yearID" "round" "playerID" "teamID" ...
##  $ CollegePlaying     :Classes 'tbl_df', 'tbl' and 'data.frame': 3 obs. of  1 variable:
##   ..$ var: chr [1:3] "playerID" "schoolID" "yearID"
##  $ Fielding           :Classes 'tbl_df', 'tbl' and 'data.frame': 18 obs. of  1 variable:
##   ..$ var: chr [1:18] "playerID" "yearID" "stint" "teamID" ...
##  $ fieldingLabels     :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  1 variable:
##   ..$ var: chr [1:2] "variable" "label"
##  $ FieldingOF         :Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of  1 variable:
##   ..$ var: chr [1:6] "playerID" "yearID" "stint" "Glf" ...
##  $ FieldingPost       :Classes 'tbl_df', 'tbl' and 'data.frame': 17 obs. of  1 variable:
##   ..$ var: chr [1:17] "playerID" "yearID" "teamID" "lgID" ...
##  $ HallOfFame         :Classes 'tbl_df', 'tbl' and 'data.frame': 9 obs. of  1 variable:
##   ..$ var: chr [1:9] "playerID" "yearID" "votedBy" "ballots" ...
##  $ LahmanData         :Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of  1 variable:
##   ..$ var: chr [1:5] "file" "class" "nobs" "nvar" ...
##  $ Managers           :Classes 'tbl_df', 'tbl' and 'data.frame': 10 obs. of  1 variable:
##   ..$ var: chr [1:10] "playerID" "yearID" "teamID" "lgID" ...
##  $ ManagersHalf       :Classes 'tbl_df', 'tbl' and 'data.frame': 10 obs. of  1 variable:
##   ..$ var: chr [1:10] "playerID" "yearID" "teamID" "lgID" ...
##  $ Master             :Classes 'tbl_df', 'tbl' and 'data.frame': 26 obs. of  1 variable:
##   ..$ var: chr [1:26] "playerID" "birthYear" "birthMonth" "birthDay" ...
##  $ Pitching           :Classes 'tbl_df', 'tbl' and 'data.frame': 30 obs. of  1 variable:
##   ..$ var: chr [1:30] "playerID" "yearID" "stint" "teamID" ...
##  $ pitchingLabels     :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  1 variable:
##   ..$ var: chr [1:2] "variable" "label"
##  $ PitchingPost       :Classes 'tbl_df', 'tbl' and 'data.frame': 30 obs. of  1 variable:
##   ..$ var: chr [1:30] "playerID" "yearID" "round" "teamID" ...
##  $ Salaries           :Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of  1 variable:
##   ..$ var: chr [1:5] "yearID" "teamID" "lgID" "playerID" ...
##  $ Schools            :Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of  1 variable:
##   ..$ var: chr [1:5] "schoolID" "name_full" "city" "state" ...
##  $ SeriesPost         :Classes 'tbl_df', 'tbl' and 'data.frame': 9 obs. of  1 variable:
##   ..$ var: chr [1:9] "yearID" "round" "teamIDwinner" "lgIDwinner" ...
##  $ Teams              :Classes 'tbl_df', 'tbl' and 'data.frame': 48 obs. of  1 variable:
##   ..$ var: chr [1:48] "yearID" "lgID" "teamID" "franchID" ...
##  $ TeamsFranchises    :Classes 'tbl_df', 'tbl' and 'data.frame': 4 obs. of  1 variable:
##   ..$ var: chr [1:4] "franchID" "franchName" "active" "NAassoc"
##  $ TeamsHalf          :Classes 'tbl_df', 'tbl' and 'data.frame': 10 obs. of  1 variable:
##   ..$ var: chr [1:10] "yearID" "lgID" "teamID" "Half" ...
# Find variables in common
reduce(lahmanNames, intersect)
## # A tibble: 0 x 1
## # ... with 1 variables: var <chr>

Common keys?

lahmanNames %>%  
  # Bind the data frames in lahmanNames
  bind_rows(.id = 'dataframe') %>%
  # Group the result by var
  group_by(var) %>%
  # Tally the number of appearances
  tally() %>%
  # Filter the data
  filter(n > 1) %>% 
  # Arrange the results
  arrange(desc(n))
## # A tibble: 59 x 2
##         var     n
##       <chr> <int>
##  1   yearID    21
##  2 playerID    19
##  3     lgID    17
##  4   teamID    13
##  5        G    10
##  6        L     6
##  7        W     6
##  8       BB     5
##  9       CS     5
## 10       GS     5
## # ... with 49 more rows

playerID

lahmanNames %>% 
  # Bind the data frames
  bind_rows(.id = 'dataframe') %>%
  # Filter the results
  filter(var == 'playerID') %>% 
  # Extract the dataframe variable
  `$`(dataframe)
##  [1] "AllstarFull"         "Appearances"         "AwardsManagers"     
##  [4] "AwardsPlayers"       "AwardsShareManagers" "AwardsSharePlayers" 
##  [7] "Batting"             "BattingPost"         "CollegePlaying"     
## [10] "Fielding"            "FieldingOF"          "FieldingPost"       
## [13] "HallOfFame"          "Managers"            "ManagersHalf"       
## [16] "Master"              "Pitching"            "PitchingPost"       
## [19] "Salaries"

Salaries

Who are the players?

  • Get count of distinct players
players <- Master %>% 
  # Return one row for each distinct player
  distinct(playerID, nameFirst, nameLast)
  
nrow(players)
## [1] 18846
nrow(Master)
## [1] 18846

Missing salaries

  • How many players are completely missing salary information?
head(players)
##    playerID nameFirst nameLast
## 1 aardsda01     David  Aardsma
## 2 aaronha01      Hank    Aaron
## 3 aaronto01    Tommie    Aaron
## 4  aasedo01       Don     Aase
## 5  abadan01      Andy     Abad
## 6  abadfe01  Fernando     Abad
head(Salaries)
##   yearID teamID lgID  playerID salary
## 1   1985    ATL   NL barkele01 870000
## 2   1985    ATL   NL bedrost01 550000
## 3   1985    ATL   NL benedbr01 545000
## 4   1985    ATL   NL  campri01 633333
## 5   1985    ATL   NL ceronri01 625000
## 6   1985    ATL   NL chambch01 800000
players %>% 
  # Find all players who do not appear in Salaries
  anti_join(Salaries, by = 'playerID') %>%
  # Count them
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1 13888

Unpaid games?

  • Is it possible that these players somehow did not play (and hence did not earn a salary)?
head(Appearances)
##   yearID teamID lgID  playerID G_all GS G_batting G_defense G_p G_c G_1b
## 1   1871    BS1   NA barnero01    31 NA        31        31   0   0    0
## 2   1871    BS1   NA barrofr01    18 NA        18        18   0   0    0
## 3   1871    BS1   NA birdsda01    29 NA        29        29   0   7    0
## 4   1871    BS1   NA  conefr01    19 NA        19        19   0   0    0
## 5   1871    BS1   NA gouldch01    31 NA        31        31   0   0   30
## 6   1871    BS1   NA jackssa01    16 NA        16        16   0   0    0
##   G_2b G_3b G_ss G_lf G_cf G_rf G_of G_dh G_ph G_pr
## 1   16    0   15    0    0    0    0   NA   NA   NA
## 2    1    0    0   13    0    4   17   NA   NA   NA
## 3    0    0    0    0    0   27   27   NA   NA   NA
## 4    0    0    0   18    0    1   18   NA   NA   NA
## 5    0    0    0    0    0    1    1   NA   NA   NA
## 6   14    0    1    0    1    0    1   NA   NA   NA
players %>% 
  anti_join(Salaries, by = "playerID") %>% 
  # How many unsalaried players appear in Appearances?
  semi_join(Appearances, by = "playerID") %>% 
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1 13695

How many games?

  • Can you determine how many games each of these unsalaried players played?
players %>% 
  # Find all players who do not appear in Salaries
  anti_join(Salaries, by = "playerID") %>% 
  # Join them to Appearances
  left_join(Appearances, by = "playerID") %>% 
  # Calculate total_games for each player
  group_by(playerID) %>%
  summarise(total_games = sum(G_all, na.rm = T)) %>%
  # Arrange in descending order by total_games
  arrange(desc(total_games))
## # A tibble: 13,888 x 2
##     playerID total_games
##        <chr>       <int>
##  1 yastrca01        3308
##  2 aaronha01        3298
##  3  cobbty01        3034
##  4 musiast01        3026
##  5  mayswi01        2992
##  6 robinbr01        2896
##  7 kalinal01        2834
##  8 collied01        2826
##  9 robinfr02        2808
## 10 wagneho01        2794
## # ... with 13,878 more rows

How many at-bats?

Is it possible that the unsalaried players did not actually play in the games that they appeared in?

head(Batting)
##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2
## 4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0
## 5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2
## 6 armstbo01   1871     1    FW1   NA 12  49  9 11   2   1  0   5  0  1  0
##   SO IBB HBP SH SF GIDP
## 1  0  NA  NA NA NA   NA
## 2  0  NA  NA NA NA   NA
## 3  5  NA  NA NA NA   NA
## 4  2  NA  NA NA NA   NA
## 5  1  NA  NA NA NA   NA
## 6  1  NA  NA NA NA   NA
players %>%
  # Find unsalaried players
  anti_join(Salaries, by = "playerID") %>% 
  # Join Batting to the unsalaried players
  left_join(Batting, by = "playerID") %>% 
  # Group by player
  group_by(playerID) %>% 
  # Sum at-bats for each player
  summarize(total_at_bat = sum(AB, na.rm = T)) %>% 
  # Arrange in descending order
  arrange(desc(total_at_bat))
## # A tibble: 13,888 x 2
##     playerID total_at_bat
##        <chr>        <int>
##  1 aaronha01        12364
##  2 yastrca01        11988
##  3  cobbty01        11434
##  4 musiast01        10972
##  5  mayswi01        10881
##  6 robinbr01        10654
##  7 wagneho01        10430
##  8 brocklo01        10332
##  9 ansonca01        10277
## 10 aparilu01        10230
## # ... with 13,878 more rows

Introducing the hall of fame

Hall of fame nominations

  • Let’s see how many players have been nominated for the Hall of Fame.
head(HallOfFame)
##    playerID yearID votedBy ballots needed votes inducted category
## 1  cobbty01   1936   BBWAA     226    170   222        Y   Player
## 2  ruthba01   1936   BBWAA     226    170   215        Y   Player
## 3 wagneho01   1936   BBWAA     226    170   215        Y   Player
## 4 mathech01   1936   BBWAA     226    170   205        Y   Player
## 5 johnswa01   1936   BBWAA     226    170   189        Y   Player
## 6 lajoina01   1936   BBWAA     226    170   146        N   Player
##   needed_note
## 1        <NA>
## 2        <NA>
## 3        <NA>
## 4        <NA>
## 5        <NA>
## 6        <NA>
# Find the distinct players that appear in HallOfFame
nominated <- HallOfFame %>% 
  distinct(playerID)

nominated %>% 
  # Count the number of players in nominated
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1  1239
nominated_full <- nominated %>% 
  # Join to Master
  left_join(Master, by = "playerID") %>% 
  # Return playerID, nameFirst, nameLast
  select(playerID, nameFirst, nameLast)

head(nominated_full)
##    playerID nameFirst  nameLast
## 1  cobbty01        Ty      Cobb
## 2  ruthba01      Babe      Ruth
## 3 wagneho01     Honus    Wagner
## 4 mathech01   Christy Mathewson
## 5 johnswa01    Walter   Johnson
## 6 lajoina01       Nap    Lajoie

Hall of fame inductions

  • Let’s now see how many players were admitted to the hall of fame to examine how selective the voting process is.
# Find distinct players in HallOfFame with inducted == "Y"
inducted <- HallOfFame %>% 
  filter(inducted == "Y") %>%
  distinct(playerID)

# Note if you wanted to keep all of the columns you can use keep_all in the distinct function
HallOfFame %>%
  distinct(playerID, .keep_all = TRUE) %>%
  filter(inducted == "Y") %>% 
  head()
##    playerID yearID    votedBy ballots needed votes inducted
## 1  cobbty01   1936      BBWAA     226    170   222        Y
## 2  ruthba01   1936      BBWAA     226    170   215        Y
## 3 wagneho01   1936      BBWAA     226    170   215        Y
## 4 mathech01   1936      BBWAA     226    170   205        Y
## 5 johnswa01   1936      BBWAA     226    170   189        Y
## 6 bulkemo99   1937 Centennial      NA     NA    NA        Y
##            category needed_note
## 1            Player        <NA>
## 2            Player        <NA>
## 3            Player        <NA>
## 4            Player        <NA>
## 5            Player        <NA>
## 6 Pioneer/Executive        <NA>
inducted %>% 
  # Count the number of players in inducted
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1   312
inducted_full <- inducted %>% 
  # Join to Master
  left_join(Master, by = "playerID") %>% 
  # Return playerID, nameFirst, nameLast
  select(playerID, nameFirst, nameLast)

head(inducted_full)
##    playerID nameFirst  nameLast
## 1  cobbty01        Ty      Cobb
## 2  ruthba01      Babe      Ruth
## 3 wagneho01     Honus    Wagner
## 4 mathech01   Christy Mathewson
## 5 johnswa01    Walter   Johnson
## 6 lajoina01       Nap    Lajoie

Awards

  • Did nominees who were inducted earn more awards than nominees who were not inducted?
# Tally the number of awards in AwardsPlayers by playerID
nAwards <- AwardsPlayers %>% 
  group_by(playerID) %>% 
  tally()

head(nAwards)
## # A tibble: 6 x 2
##    playerID     n
##       <chr> <int>
## 1 aaronha01    16
## 2 abbotji01     2
## 3 abernte02     2
## 4 abreubo01     2
## 5 abreujo02     2
## 6 adamsba01     2
nAwards %>% 
  # Filter to just the players in inducted 
  semi_join(inducted, by = "playerID") %>% 
  # Calculate the mean number of awards per player
  summarize(avg_n = mean(n, na.rm = T))
## # A tibble: 1 x 1
##      avg_n
##      <dbl>
## 1 12.10582
nAwards %>% 
  # Filter to just the players in nominated 
  semi_join(nominated, by = "playerID") %>% 
  # Filter to players NOT in inducted 
  anti_join(inducted, by = "playerID") %>%
  # Calculate the mean number of awards per player
  summarize(avg_n = mean(n, na.rm = T))
## # A tibble: 1 x 1
##     avg_n
##     <dbl>
## 1 4.18985

Salary

  • Does the maximum salary earned by inductees tend to be greater than the maximum salary earned by nominees who were not inducted?
# Find the players who are in nominated, but not inducted
notInducted <- nominated %>% 
  setdiff(inducted, by = "playerID")

Salaries %>% 
  # Find the players who are in notInducted
  semi_join(notInducted) %>%
  # Calculate the max salary by player
  group_by(playerID) %>% 
  summarize(max_salary = max(salary)) %>% 
  # Calculate the average of the max salaries
  summarize(avg_salary = mean(max_salary))
## # A tibble: 1 x 1
##   avg_salary
##        <dbl>
## 1    4876812
# Repeat for players who were inducted
Salaries %>% 
  semi_join(inducted) %>%
  # Calculate the max salary by player
  group_by(playerID) %>% 
  summarize(max_salary = max(salary)) %>% 
  # Calculate the average of the max salaries
  summarize(avg_salary = mean(max_salary))
## # A tibble: 1 x 1
##   avg_salary
##        <dbl>
## 1    5673190

Retirement

  • One of the rules of the Hall of Fame is that players cannot be nominated until five years after they retire. Is this reflected in our data?
head(Appearances)
##   yearID teamID lgID  playerID G_all GS G_batting G_defense G_p G_c G_1b
## 1   1871    BS1   NA barnero01    31 NA        31        31   0   0    0
## 2   1871    BS1   NA barrofr01    18 NA        18        18   0   0    0
## 3   1871    BS1   NA birdsda01    29 NA        29        29   0   7    0
## 4   1871    BS1   NA  conefr01    19 NA        19        19   0   0    0
## 5   1871    BS1   NA gouldch01    31 NA        31        31   0   0   30
## 6   1871    BS1   NA jackssa01    16 NA        16        16   0   0    0
##   G_2b G_3b G_ss G_lf G_cf G_rf G_of G_dh G_ph G_pr
## 1   16    0   15    0    0    0    0   NA   NA   NA
## 2    1    0    0   13    0    4   17   NA   NA   NA
## 3    0    0    0    0    0   27   27   NA   NA   NA
## 4    0    0    0   18    0    1   18   NA   NA   NA
## 5    0    0    0    0    0    1    1   NA   NA   NA
## 6   14    0    1    0    1    0    1   NA   NA   NA
Appearances %>% 
  # Filter Appearances against nominated
  semi_join(nominated, by = "playerID") %>%
  # Find last year played by player
  group_by(playerID) %>% 
  summarize(last_year = max(yearID)) %>%
  # Join to full HallOfFame
  left_join(HallOfFame, by = "playerID") %>% 
  # Filter for unusual observations
  filter(yearID <= last_year)
## # A tibble: 39 x 10
##     playerID last_year yearID         votedBy ballots needed votes
##        <chr>     <dbl>  <int>           <chr>   <int>  <int> <int>
##  1 cissebi01      1938   1937           BBWAA     201    151     1
##  2 cochrmi01      1937   1936           BBWAA     226    170    80
##  3  deandi01      1947   1945           BBWAA     247    186    17
##  4  deandi01      1947   1946    Final Ballot     263    198    45
##  5  deandi01      1947   1946 Nominating Vote     202     NA    40
##  6  deandi01      1947   1947           BBWAA     161    121    88
##  7 dickebi01      1946   1945           BBWAA     247    186    17
##  8 dickebi01      1946   1946 Nominating Vote     202     NA    40
##  9 dickebi01      1946   1946    Final Ballot     263    198    32
## 10 dimagjo01      1951   1945           BBWAA     247    186     1
## # ... with 29 more rows, and 3 more variables: inducted <fctr>,
## #   category <fctr>, needed_note <chr>