Whats Covered

  • Mutating joins
  • Filtering joins and set operations
  • Assembling data
  • Advanced joining
  • Case Study

Mutating Joins


  • Advantages to dplyr over base R merge function for joing data
    • dplyr always preserves the row order
    • dplyr has much more intuitive syntax
    • dplyr can be applied to databases, or spark
  • dplyr is a front end language for maniulating data that can be converted to multiple backends like SQL or spark.
    • This is pretty useful. Everything we cover in this class could work on a spark dataframe or a sql table

Keys

  • The Primary key needs to be unique in a table
  • The foreign key in the second table can be duplicated
  • second table will be matched to the primary table based on the primary key
  • The primary key may be one, two or even more columns in the table

Primary Keys

  • What is the primary key in artists dataset?
    • name
## # A tibble: 10 x 2
##    name            instrument
##    <chr>           <chr>     
##  1 Jimmy Buffett   Guitar    
##  2 George Harrison Guitar    
##  3 Mick Jagger     Vocals    
##  4 Tom Jones       Vocals    
##  5 Davy Jones      Vocals    
##  6 John Lennon     Guitar    
##  7 Paul McCartney  Bass      
##  8 Jimmy Page      Guitar    
##  9 Joe Perry       Guitar    
## 10 Elvis Presley   Vocals

Secondary Keys

  • Which datasets have a primary key that matches artists$name?
    • bands$name
    • songs$writer
## # A tibble: 6 x 3
##   name                      band                year
##   <chr>                     <chr>              <int>
## 1 A Hard Day's Night        The Beatles         1964
## 2 Magical Mystery Tour      The Beatles         1967
## 3 Beggar's Banquet          The Rolling Stones  1968
## 4 Abbey Road                The Beatles         1969
## 5 Led Zeppelin IV           Led Zeppelin        1971
## 6 The Dark Side of the Moon Pink Floyd          1973
## # A tibble: 6 x 2
##   name            band        
##   <chr>           <chr>       
## 1 John Bonham     Led Zeppelin
## 2 John Paul Jones Led Zeppelin
## 3 Jimmy Page      Led Zeppelin
## 4 Robert Plant    Led Zeppelin
## 5 George Harrison The Beatles 
## 6 John Lennon     The Beatles
## # A tibble: 4 x 3
##   song             album                writer        
##   <chr>            <chr>                <chr>         
## 1 Come Together    Abbey Road           John Lennon   
## 2 Dream On         Aerosmith            Steven Tyler  
## 3 Hello, Goodbye   Magical Mystery Tour Paul McCartney
## 4 It's Not Unusual Along Came Jones     Tom Jones

Multi-variable keys

  • What is the primary key in artists?
    • The combination of first and last
## # A tibble: 6 x 3
##   first  last     instrument
##   <chr>  <chr>    <chr>     
## 1 Jimmy  Buffett  Guitar    
## 2 George Harrison Guitar    
## 3 Mick   Jagger   Vocals    
## 4 Tom    Jones    Vocals    
## 5 Davy   Jones    Vocals    
## 6 John   Lennon   Guitar

Joins

  • Joins in dplyr will work on data frames, tibbles (tbl_df), and tbl references

A basic join

## # A tibble: 6 x 3
##   first  last     instrument
##   <chr>  <chr>    <chr>     
## 1 Jimmy  Buffett  Guitar    
## 2 George Harrison Guitar    
## 3 Mick   Jagger   Vocals    
## 4 Tom    Jones    Vocals    
## 5 Davy   Jones    Vocals    
## 6 John   Lennon   Guitar
## # A tibble: 6 x 3
##   first     last     band        
##   <chr>     <chr>    <chr>       
## 1 John      Bonham   Led Zeppelin
## 2 John Paul Jones    Led Zeppelin
## 3 Jimmy     Page     Led Zeppelin
## 4 Robert    Plant    Led Zeppelin
## 5 George    Harrison The Beatles 
## 6 John      Lennon   The Beatles
## # A tibble: 13 x 4
##    first     last      band               instrument
##    <chr>     <chr>     <chr>              <chr>     
##  1 John      Bonham    Led Zeppelin       <NA>      
##  2 John Paul Jones     Led Zeppelin       <NA>      
##  3 Jimmy     Page      Led Zeppelin       Guitar    
##  4 Robert    Plant     Led Zeppelin       <NA>      
##  5 George    Harrison  The Beatles        Guitar    
##  6 John      Lennon    The Beatles        Guitar    
##  7 Paul      McCartney The Beatles        Bass      
##  8 Ringo     Starr     The Beatles        Drums     
##  9 Jimmy     Buffett   The Coral Reefers  Guitar    
## 10 Mick      Jagger    The Rolling Stones Vocals    
## 11 Keith     Richards  The Rolling Stones Guitar    
## 12 Charlie   Watts     The Rolling Stones <NA>      
## 13 Ronnie    Wood      The Rolling Stones <NA>

A second join

  • Notice how the last column is pulled in as last.x and last.y if not included in the join - And how all combination of rows are created for each name that is duplicated, e.g. john
## # A tibble: 16 x 5
##    first     last.x    band               last.y    instrument
##    <chr>     <chr>     <chr>              <chr>     <chr>     
##  1 John      Bonham    Led Zeppelin       Lennon    Guitar    
##  2 John Paul Jones     Led Zeppelin       <NA>      <NA>      
##  3 Jimmy     Page      Led Zeppelin       Buffett   Guitar    
##  4 Jimmy     Page      Led Zeppelin       Page      Guitar    
##  5 Robert    Plant     Led Zeppelin       <NA>      <NA>      
##  6 George    Harrison  The Beatles        Harrison  Guitar    
##  7 John      Lennon    The Beatles        Lennon    Guitar    
##  8 Paul      McCartney The Beatles        McCartney Bass      
##  9 Paul      McCartney The Beatles        Simon     Guitar    
## 10 Ringo     Starr     The Beatles        Starr     Drums     
## 11 Jimmy     Buffett   The Coral Reefers  Buffett   Guitar    
## 12 Jimmy     Buffett   The Coral Reefers  Page      Guitar    
## 13 Mick      Jagger    The Rolling Stones Jagger    Vocals    
## 14 Keith     Richards  The Rolling Stones Richards  Guitar    
## 15 Charlie   Watts     The Rolling Stones <NA>      <NA>      
## 16 Ronnie    Wood      The Rolling Stones <NA>      <NA>
## # A tibble: 13 x 4
##    first     last      band               instrument
##    <chr>     <chr>     <chr>              <chr>     
##  1 John      Bonham    Led Zeppelin       <NA>      
##  2 John Paul Jones     Led Zeppelin       <NA>      
##  3 Jimmy     Page      Led Zeppelin       Guitar    
##  4 Robert    Plant     Led Zeppelin       <NA>      
##  5 George    Harrison  The Beatles        Guitar    
##  6 John      Lennon    The Beatles        Guitar    
##  7 Paul      McCartney The Beatles        Bass      
##  8 Ringo     Starr     The Beatles        Drums     
##  9 Jimmy     Buffett   The Coral Reefers  Guitar    
## 10 Mick      Jagger    The Rolling Stones Vocals    
## 11 Keith     Richards  The Rolling Stones Guitar    
## 12 Charlie   Watts     The Rolling Stones <NA>      
## 13 Ronnie    Wood      The Rolling Stones <NA>

A right join

## # A tibble: 6 x 4
##   first     last     band         instrument
##   <chr>     <chr>    <chr>        <chr>     
## 1 John      Bonham   Led Zeppelin <NA>      
## 2 John Paul Jones    Led Zeppelin <NA>      
## 3 Jimmy     Page     Led Zeppelin Guitar    
## 4 Robert    Plant    Led Zeppelin <NA>      
## 5 George    Harrison The Beatles  Guitar    
## 6 John      Lennon   The Beatles  Guitar
## # A tibble: 6 x 4
##   first     last     instrument band        
##   <chr>     <chr>    <chr>      <chr>       
## 1 John      Bonham   <NA>       Led Zeppelin
## 2 John Paul Jones    <NA>       Led Zeppelin
## 3 Jimmy     Page     Guitar     Led Zeppelin
## 4 Robert    Plant    <NA>       Led Zeppelin
## 5 George    Harrison Guitar     The Beatles 
## 6 John      Lennon   Guitar     The Beatles
## [1] TRUE

Variations on joins

  • left_join - prioritizes left dataset
  • right_join - prioritizes right dataset
  • inner_join - only retains rows in both datasets
  • full_join - retains all rows
  • Use %>% (pipes) to string together these joins

inner joins and full joins

## # A tibble: 4 x 4
##   song             album                first  last     
##   <chr>            <chr>                <chr>  <chr>    
## 1 Come Together    Abbey Road           John   Lennon   
## 2 Dream On         Aerosmith            Steven Tyler    
## 3 Hello, Goodbye   Magical Mystery Tour Paul   McCartney
## 4 It's Not Unusual Along Came Jones     Tom    Jones
## # A tibble: 6 x 3
##   album                     band                year
##   <chr>                     <chr>              <int>
## 1 A Hard Day's Night        The Beatles         1964
## 2 Magical Mystery Tour      The Beatles         1967
## 3 Beggar's Banquet          The Rolling Stones  1968
## 4 Abbey Road                The Beatles         1969
## 5 Led Zeppelin IV           Led Zeppelin        1971
## 6 The Dark Side of the Moon Pink Floyd          1973
## # A tibble: 3 x 6
##   song           album                first  last      band         year
##   <chr>          <chr>                <chr>  <chr>     <chr>       <int>
## 1 Come Together  Abbey Road           John   Lennon    The Beatles  1969
## 2 Dream On       Aerosmith            Steven Tyler     Aerosmith    1973
## 3 Hello, Goodbye Magical Mystery Tour Paul   McCartney The Beatles  1967
## # A tibble: 6 x 3
##   first  last     instrument
##   <chr>  <chr>    <chr>     
## 1 Jimmy  Buffett  Guitar    
## 2 George Harrison Guitar    
## 3 Mick   Jagger   Vocals    
## 4 Tom    Jones    Vocals    
## 5 Davy   Jones    Vocals    
## 6 John   Lennon   Guitar
## # A tibble: 6 x 3
##   first     last     band        
##   <chr>     <chr>    <chr>       
## 1 John      Bonham   Led Zeppelin
## 2 John Paul Jones    Led Zeppelin
## 3 Jimmy     Page     Led Zeppelin
## 4 Robert    Plant    Led Zeppelin
## 5 George    Harrison The Beatles 
## 6 John      Lennon   The Beatles
## # A tibble: 21 x 4
##    first  last      instrument band              
##    <chr>  <chr>     <chr>      <chr>             
##  1 Jimmy  Buffett   Guitar     The Coral Reefers 
##  2 George Harrison  Guitar     The Beatles       
##  3 Mick   Jagger    Vocals     The Rolling Stones
##  4 Tom    Jones     Vocals     <NA>              
##  5 Davy   Jones     Vocals     <NA>              
##  6 John   Lennon    Guitar     The Beatles       
##  7 Paul   McCartney Bass       The Beatles       
##  8 Jimmy  Page      Guitar     Led Zeppelin      
##  9 Joe    Perry     Guitar     <NA>              
## 10 Elvis  Presley   Vocals     <NA>              
## # ... with 11 more rows

Pipes

## # A tibble: 5 x 3
##   first  last     band              
##   <chr>  <chr>    <chr>             
## 1 Jimmy  Page     Led Zeppelin      
## 2 George Harrison The Beatles       
## 3 John   Lennon   The Beatles       
## 4 Jimmy  Buffett  The Coral Reefers 
## 5 Keith  Richards The Rolling Stones
## # A tibble: 5 x 3
##   first  last     band              
##   <chr>  <chr>    <chr>             
## 1 Jimmy  Page     Led Zeppelin      
## 2 George Harrison The Beatles       
## 3 John   Lennon   The Beatles       
## 4 Jimmy  Buffett  The Coral Reefers 
## 5 Keith  Richards The Rolling Stones

Practice with pipes and joins

## # A tibble: 3 x 6
##   first last      instrument band        song             album               
##   <chr> <chr>     <chr>      <chr>       <chr>            <chr>               
## 1 Tom   Jones     Vocals     <NA>        It's Not Unusual Along Came Jones    
## 2 John  Lennon    Guitar     The Beatles Come Together    Abbey Road          
## 3 Paul  McCartney Bass       The Beatles Hello, Goodbye   Magical Mystery Tour
## # A tibble: 6 x 3
##   first  last     instrument
##   <chr>  <chr>    <chr>     
## 1 Jimmy  Buffett  Guitar    
## 2 George Harrison Guitar    
## 3 Mick   Jagger   Vocals    
## 4 Tom    Jones    Vocals    
## 5 Davy   Jones    Vocals    
## 6 John   Lennon   Guitar
## # A tibble: 6 x 3
##   first     last     band        
##   <chr>     <chr>    <chr>       
## 1 John      Bonham   Led Zeppelin
## 2 John Paul Jones    Led Zeppelin
## 3 Jimmy     Page     Led Zeppelin
## 4 Robert    Plant    Led Zeppelin
## 5 George    Harrison The Beatles 
## 6 John      Lennon   The Beatles
## # A tibble: 4 x 4
##   song             album                first  last     
##   <chr>            <chr>                <chr>  <chr>    
## 1 Come Together    Abbey Road           John   Lennon   
## 2 Dream On         Aerosmith            Steven Tyler    
## 3 Hello, Goodbye   Magical Mystery Tour Paul   McCartney
## 4 It's Not Unusual Along Came Jones     Tom    Jones
## [1] TRUE

Choose your joins

## # A tibble: 29 x 7
##    first  last     instrument band            song         album            year
##    <chr>  <chr>    <chr>      <chr>           <chr>        <chr>           <int>
##  1 Jimmy  Buffett  Guitar     The Coral Reef~ <NA>         <NA>               NA
##  2 George Harrison Guitar     The Beatles     <NA>         <NA>               NA
##  3 Mick   Jagger   Vocals     The Rolling St~ <NA>         <NA>               NA
##  4 Tom    Jones    Vocals     <NA>            It's Not Un~ Along Came Jon~    NA
##  5 Davy   Jones    Vocals     <NA>            <NA>         <NA>               NA
##  6 John   Lennon   Guitar     The Beatles     Come Togeth~ Abbey Road       1969
##  7 Paul   McCartn~ Bass       The Beatles     Hello, Good~ Magical Myster~  1967
##  8 Jimmy  Page     Guitar     Led Zeppelin    <NA>         <NA>               NA
##  9 Joe    Perry    Guitar     <NA>            <NA>         <NA>               NA
## 10 Elvis  Presley  Vocals     <NA>            <NA>         <NA>               NA
## # ... with 19 more rows

   


Filtering joins and set operations


  • Filtering joins return a copy of the dataset that has been filtered, not augmented (as with mutating joins)

Semi-joins

Apply a semi-join

## # A tibble: 3 x 3
##   first last      instrument
##   <chr> <chr>     <chr>     
## 1 Tom   Jones     Vocals    
## 2 John  Lennon    Guitar    
## 3 Paul  McCartney Bass
## # A tibble: 3 x 3
##   first last      instrument
##   <chr> <chr>     <chr>     
## 1 John  Lennon    Guitar    
## 2 Paul  McCartney Bass      
## 3 Tom   Jones     Vocals

A more precise way to filter?

  • You can use semi_joins with a df of date you want to match to easily filter
## # A tibble: 21 x 6
##    track                       band              label      year first  last    
##    <chr>                       <chr>             <chr>     <int> <chr>  <chr>   
##  1 Can't Buy Me Love           The Beatles       Parlopho~  1964 Paul   McCartn~
##  2 I Feel Fine                 The Beatles       Parlopho~  1964 John   Lennon  
##  3 A Hard Day's Night          The Beatles       Parlopho~  1964 John   Lennon  
##  4 Sound of Silence            Simon and Garfun~ Columbia   1964 Paul   Simon   
##  5 Help!                       The Beatles       Parlopho~  1965 John   Lennon  
##  6 Ticket to Ride              The Beatles       Parlopho~  1965 John   Lennon  
##  7 I am a Rock                 Simon and Garfun~ Columbia   1965 Paul   Simon   
##  8 Yellow Submarine / Eleanor~ The Beatles       Parlopho~  1966 Paul   McCartn~
##  9 Homeward Bound              Simon and Garfun~ Columbia   1966 Paul   Simon   
## 10 Scarborough Fair            Simon and Garfun~ Columbia   1966 unkno~ unknown 
## # ... with 11 more rows
## # A tibble: 3 x 3
##   band                 year first
##   <chr>               <int> <chr>
## 1 The Beatles          1964 Paul 
## 2 The Beatles          1965 John 
## 3 Simon and Garfunkel  1966 Paul
## # A tibble: 4 x 6
##   track             band                label       year first last     
##   <chr>             <chr>               <chr>      <int> <chr> <chr>    
## 1 Can't Buy Me Love The Beatles         Parlophone  1964 Paul  McCartney
## 2 Help!             The Beatles         Parlophone  1965 John  Lennon   
## 3 Ticket to Ride    The Beatles         Parlophone  1965 John  Lennon   
## 4 Homeward Bound    Simon and Garfunkel Columbia    1966 Paul  Simon
## # A tibble: 4 x 6
##   track             band                label       year first last     
##   <chr>             <chr>               <chr>      <int> <chr> <chr>    
## 1 Can't Buy Me Love The Beatles         Parlophone  1964 Paul  McCartney
## 2 Help!             The Beatles         Parlophone  1965 John  Lennon   
## 3 Ticket to Ride    The Beatles         Parlophone  1965 John  Lennon   
## 4 Homeward Bound    Simon and Garfunkel Columbia    1966 Paul  Simon

Anti-joins

Apply an anti-join

## # A tibble: 8 x 3
##   first last    instrument
##   <chr> <chr>   <chr>     
## 1 Tom   Jones   Vocals    
## 2 Davy  Jones   Vocals    
## 3 Joe   Perry   Guitar    
## 4 Elvis Presley Vocals    
## 5 Paul  Simon   Guitar    
## 6 Joe   Walsh   Guitar    
## 7 Brian Wilson  Vocals    
## 8 Nancy Wilson  Vocals

Apply another anti-join

## Observations: 9
## Variables: 2
## $ album <chr> "Abbey Road", "A Hard Days Night", "Magical Mystery Tour", "L...
## $ label <chr> "Apple", "Parlophone", "Parlophone", "Atlantic", "Harvest", "...
## Observations: 9
## Variables: 3
## $ album <chr> "A Hard Day's Night", "Magical Mystery Tour", "Beggar's Banqu...
## $ band  <chr> "The Beatles", "The Beatles", "The Rolling Stones", "The Beat...
## $ year  <int> 1964, 1967, 1968, 1969, 1971, 1973, 1973, 1977, 1982
## # A tibble: 1 x 2
##   album             label     
##   <chr>             <chr>     
## 1 A Hard Days Night Parlophone

Which filtering join?

## # A tibble: 9 x 2
##   album                     label          
##   <chr>                     <chr>          
## 1 Abbey Road                Apple          
## 2 A Hard Days Night         Parlophone     
## 3 Magical Mystery Tour      Parlophone     
## 4 Led Zeppelin IV           Atlantic       
## 5 The Dark Side of the Moon Harvest        
## 6 Hotel California          Asylum         
## 7 Rumours                   Warner Brothers
## 8 Aerosmith                 Columbia       
## 9 Beggar's Banquet          Decca
## # A tibble: 4 x 4
##   song             album                first  last     
##   <chr>            <chr>                <chr>  <chr>    
## 1 Come Together    Abbey Road           John   Lennon   
## 2 Dream On         Aerosmith            Steven Tyler    
## 3 Hello, Goodbye   Magical Mystery Tour Paul   McCartney
## 4 It's Not Unusual Along Came Jones     Tom    Jones
## [1] 3

Set operations

  • union will return every row that appears in one or more of the datasets
    • If a row appears multiple times union will only return it once
  • interesect will return only the rows that appear in both datasets
  • setdiff will return the rows that appear in the first dataset but not the second

How many songs are there?

  • How many unique songs do these two albums contain in total?
## # A tibble: 6 x 2
##   song              length    
##   <chr>             <drtn>    
## 1 Make It           13260 secs
## 2 Somebody          13500 secs
## 3 Dream On          16080 secs
## 4 One Way Street    25200 secs
## 5 Mama Kin          15900 secs
## 6 Write me a Letter 15060 secs
## # A tibble: 6 x 2
##   song                    length    
##   <chr>                   <drtn>    
## 1 Dream On                16080 secs
## 2 Mama Kin                16020 secs
## 3 Same Old Song and Dance 11040 secs
## 4 Seasons of Winter       17820 secs
## 5 Sweet Emotion           11700 secs
## 6 Walk this Way           12780 secs
## [1] 24

Greatest hits

  • Which songs from Aerosmith made it onto Greates Hits?
## # A tibble: 1 x 2
##   song     length    
##   <chr>    <drtn>    
## 1 Dream On 16080 secs

Live! Bootleg songs

  • Which songs are on Live! Bootleg but not on Greatest Hits? Notice that the length of songs may be different when they are performed live.
## # A tibble: 8 x 1
##   song                                       
##   <chr>                                      
## 1 Lord of the Thighs                         
## 2 Toys in the Attic                          
## 3 Sick as a Dog                              
## 4 Sight for Sore Eyes                        
## 5 S.O.S. (Too Bad)                           
## 6 I Ain't Got You                            
## 7 Mother Popcorn/Draw the Line               
## 8 Train Kept A-Rollin'/Strangers in the Night

Multiple operations

  • Which songs appear on one of Live! Bootleg or Greatest Hits, but not both?
## # A tibble: 17 x 1
##    song                                       
##    <chr>                                      
##  1 Lord of the Thighs                         
##  2 Toys in the Attic                          
##  3 Sick as a Dog                              
##  4 Sight for Sore Eyes                        
##  5 S.O.S. (Too Bad)                           
##  6 I Ain't Got You                            
##  7 Mother Popcorn/Draw the Line               
##  8 Train Kept A-Rollin'/Strangers in the Night
##  9 Same Old Song and Dance                    
## 10 Seasons of Winter                          
## 11 Big Ten Inch Record                        
## 12 Draw the Line                              
## 13 Kings and Queens                           
## 14 Remember (Walking in the Sand)             
## 15 Lightning Strikes                          
## 16 Sweet Emotion (remix)                      
## 17 One Way Street (live)

Comparing datasets

  • Its common to want to know if one data set is the same as another dataset
  • dplyr’s setequal will do this easily
  • base R’s identical is will only return true if the datasets have the exact same rows in the exact same order

Recap:

  • Mutating Joins:
    • left_join
    • right_join
    • inner_join
    • full_join
  • Filtering Joins:
    • semi_join
    • anti_join
  • Set Operations:
    • union
    • intersect
    • setdiff
  • Comparisions
    • setequal

Apply setequal

  • Both anthologies claim to contain the complete studio recordings of Led Zeppelin, but do the anthologies contain the same exact songs?
## [1] FALSE
## [1] FALSE
## # A tibble: 15 x 2
##    song                        album                    
##    <chr>                       <chr>                    
##  1 Rock and Roll               The Song Remains the Same
##  2 Celebration Day             The Song Remains the Same
##  3 Black Dog                   The Song Remains the Same
##  4 Over the Hills and Far Away The Song Remains the Same
##  5 Misty Mountain Hop          The Song Remains the Same
##  6 Since I've Been Loving You  The Song Remains the Same
##  7 No Quarter                  The Song Remains the Same
##  8 The Song Remains the Same   The Song Remains the Same
##  9 The Rain Song               The Song Remains the Same
## 10 The Ocean                   The Song Remains the Same
## 11 Dazed and Confused          The Song Remains the Same
## 12 Stairway to Heaven          The Song Remains the Same
## 13 Moby Dick                   The Song Remains the Same
## 14 Heartbreaker                The Song Remains the Same
## 15 Whole Lotta Love            The Song Remains the Same
## # A tibble: 0 x 2
## # ... with 2 variables: song <chr>, album <chr>

Apply setequal again

  • If the datasets have the same columns and all are used in the key, then intersect is analagous to semi_join
  • In the same scenario setdiff is similar to anti_join
## # A tibble: 15 x 2
##    song                        album                    
##    <chr>                       <chr>                    
##  1 Rock and Roll               The Song Remains the Same
##  2 Celebration Day             The Song Remains the Same
##  3 Black Dog                   The Song Remains the Same
##  4 Over the Hills and Far Away The Song Remains the Same
##  5 Misty Mountain Hop          The Song Remains the Same
##  6 Since I've Been Loving You  The Song Remains the Same
##  7 No Quarter                  The Song Remains the Same
##  8 The Song Remains the Same   The Song Remains the Same
##  9 The Rain Song               The Song Remains the Same
## 10 The Ocean                   The Song Remains the Same
## 11 Dazed and Confused          The Song Remains the Same
## 12 Stairway to Heaven          The Song Remains the Same
## 13 Moby Dick                   The Song Remains the Same
## 14 Heartbreaker                The Song Remains the Same
## 15 Whole Lotta Love            The Song Remains the Same
## # A tibble: 0 x 2
## # ... with 2 variables: song <chr>, album <chr>

Comparing albums

  • Is the soundtrack of The Song Remains the Same the only difference between The Definitive Collection and The Complete Studio Recordings?
## [1] 85
## [1] 100
## [1] FALSE
## [1] TRUE

   


Assembling data


Binds

  • Base R binds
    • rbind, cbind
  • dplyr binds
    • bind_rows, bind_cols

 

  • Advantages of dplyr versions
    • faster
    • return a tibble (advanced data frame)
    • can handle lists of dataframes
    • .id argument in bind_rows allows you to pass in a name for each source dataframe
      • this will be added as a column to the result dataset
      • this is really handy when one of the variables for the data set is stored in the table name.
      • e.g. houly weather data for different days
    • rbind will return an error if the column names do not match exactly.
      • bind_rows will create a column for each unique column and distribute NAs appropriately

Which bind?

## # A tibble: 5 x 2
##   song                     length    
##   <chr>                    <drtn>    
## 1 Speak to Me               5400 secs
## 2 Breathe                   9780 secs
## 3 On the Run               12600 secs
## 4 Time                     24780 secs
## 5 The Great Gig in the Sky 15300 secs
## # A tibble: 5 x 2
##   song                length    
##   <chr>               <drtn>    
## 1 Money               23400 secs
## 2 Us and Them         28260 secs
## 3 Any Colour You Like 12240 secs
## 4 Brain Damage        13800 secs
## 5 Eclipse              7380 secs
## # A tibble: 10 x 2
##    song                     length    
##    <chr>                    <drtn>    
##  1 Speak to Me               5400 secs
##  2 Breathe                   9780 secs
##  3 On the Run               12600 secs
##  4 Time                     24780 secs
##  5 The Great Gig in the Sky 15300 secs
##  6 Money                    23400 secs
##  7 Us and Them              28260 secs
##  8 Any Colour You Like      12240 secs
##  9 Brain Damage             13800 secs
## 10 Eclipse                   7380 secs

Bind rows

  • you can pass bind_rows() a list of data frames like jimi to bind together into a single data frame
  • you can save the name of the list element into a column with the .id argument
## # A tibble: 3 x 2
##   album                year
##   <chr>               <int>
## 1 Are You Experienced  1967
## 2 Axis: Bold as Love   1967
## 3 Electric Ladyland    1968
## $`Are You Experienced`
## # A tibble: 10 x 2
##    song                     length    
##    <chr>                    <drtn>    
##  1 Purple Haze               9960 secs
##  2 Manic Depression         13560 secs
##  3 Hey Joe                  12180 secs
##  4 May This Be Love         11640 secs
##  5 I Don't Live Today       14100 secs
##  6 The Wind Cries Mary      12060 secs
##  7 Fire                      9240 secs
##  8 Third Stone from the Sun 24000 secs
##  9 Foxy Lady                11700 secs
## 10 Are You Experienced?     14100 secs
## 
## $`Axis: Bold As Love`
## # A tibble: 13 x 2
##    song                 length    
##    <chr>                <drtn>    
##  1 EXP                   6900 secs
##  2 Up from the Skies    10500 secs
##  3 Spanish Castle Magic 10800 secs
##  4 Wait Until Tomorrow  10800 secs
##  5 Ain't No Telling      6360 secs
##  6 Little Wing           8640 secs
##  7 If 6 was 9           19920 secs
##  8 You Got Me Floatin    9900 secs
##  9 Castles Made of Sand  9960 secs
## 10 She's So Fine         9420 secs
## 11 One Rainy Wish       13200 secs
## 12 Little Miss Lover     8400 secs
## 13 Bold as Love         15060 secs
## 
## $`Electric Ladyland`
## # A tibble: 16 x 2
##    song                                       length    
##    <chr>                                      <drtn>    
##  1 And the Gods Made Love                      4860 secs
##  2 Have You Ever Been (To Electric Ladyland)   7860 secs
##  3 Crosstown Traffic                           8700 secs
##  4 Voodoo Chile                               54000 secs
##  5 Little Miss Strange                        10320 secs
##  6 Long Hot Summer Night                      12420 secs
##  7 Come On (Part 1)                           14940 secs
##  8 Gypsy Eyes                                 13380 secs
##  9 Burning of the Midnight Lamp               13140 secs
## 10 Rainy Day, Dream Away                      13320 secs
## 11 1983... (A Merman I Should Turn to Be)     49140 secs
## 12 Moon, Turn the Tides... Gently Gently Away  3720 secs
## 13 Still Raining, Still Dreaming              15900 secs
## 14 House Burning Down                         16380 secs
## 15 All Along the Watchtower                   14460 secs
## 16 Voodoo Child (Slight Return)               18720 secs
## # A tibble: 39 x 4
##    album               song                     length      year
##    <chr>               <chr>                    <drtn>     <int>
##  1 Are You Experienced Purple Haze               9960 secs  1967
##  2 Are You Experienced Manic Depression         13560 secs  1967
##  3 Are You Experienced Hey Joe                  12180 secs  1967
##  4 Are You Experienced May This Be Love         11640 secs  1967
##  5 Are You Experienced I Don't Live Today       14100 secs  1967
##  6 Are You Experienced The Wind Cries Mary      12060 secs  1967
##  7 Are You Experienced Fire                      9240 secs  1967
##  8 Are You Experienced Third Stone from the Sun 24000 secs  1967
##  9 Are You Experienced Foxy Lady                11700 secs  1967
## 10 Are You Experienced Are You Experienced?     14100 secs  1967
## # ... with 29 more rows

Bind columns

## # A tibble: 67 x 2
##     year song                                   
##    <int> <chr>                                  
##  1  1947 Move It On Over                        
##  2  1947 My Love for You (Has Turned to Hate)   
##  3  1947 Never Again (Will I Knock on Your Door)
##  4  1947 On the Banks of the Old Ponchartrain   
##  5  1947 Pan American                           
##  6  1947 Wealth Won't Save Your Soul            
##  7  1948 A Mansion on the Hill                  
##  8  1948 Honky Tonkin'                          
##  9  1948 I Saw the Light                        
## 10  1948 I'm a Long Gone Daddy                  
## # ... with 57 more rows
## # A tibble: 67 x 2
##    song                             peak
##    <chr>                           <int>
##  1 (I Heard That) Lonesome Whistle     9
##  2 (I'm Gonna) Sing, Sing, Sing       NA
##  3 A Home in Heaven                   NA
##  4 A Mansion on the Hill              12
##  5 A Teardrop on a Rose               NA
##  6 At the First Fall of Snow          NA
##  7 Baby, We're Really in Love          4
##  8 California Zephyr                  NA
##  9 Calling You                        NA
## 10 Cold, Cold Heart                    1
## # ... with 57 more rows
## # A tibble: 67 x 3
##     year song                                     peak
##    <int> <chr>                                   <int>
##  1  1947 Move It On Over                             4
##  2  1947 My Love for You (Has Turned to Hate)       NA
##  3  1947 Never Again (Will I Knock on Your Door)    NA
##  4  1947 On the Banks of the Old Ponchartrain       NA
##  5  1947 Pan American                               NA
##  6  1947 Wealth Won't Save Your Soul                NA
##  7  1948 A Mansion on the Hill                      12
##  8  1948 Honky Tonkin'                              14
##  9  1948 I'm Satisfied with You                     NA
## 10  1948 I Just Don't Like This Kind of Living       5
## # ... with 57 more rows

Danger

  • The biggest risk when using bind_cols is that the observations may not align acros the rows of your data frame

Build a better data frame

  • R Base way to create a dataframe
    • data.frame() and as.data.frame()
  • dplyr way to create a dataframe
    • data_frame() and as_data_frame()
  • Advantages of using data_frame():
    • will never change the data type of a vector. (e.g. strings to factors)
    • will never add row names
    • will not changes unusual column names
    • only recycles length 1 inputs
    • evaluate lazily and in order. (So you can reference a column in the next column definition)
    • outputs a tibble (class tbl_df)
  • as_data_frame does all these same things
    • its just used on a list of columns

Make a data frame

## [1] 1947 1947 1947 1947 1947 1947
## [1] "Move It On Over"                        
## [2] "My Love for You (Has Turned to Hate)"   
## [3] "Never Again (Will I Knock on Your Door)"
## [4] "On the Banks of the Old Ponchartrain"   
## [5] "Pan American"                           
## [6] "Wealth Won't Save Your Soul"
## [1]  4 NA NA NA NA NA
## # A tibble: 11 x 3
##     year song                                    peak
##    <int> <chr>                                  <int>
##  1  1949 Lovesick Blues                             1
##  2  1950 Long Gone Lonesome Blues                   1
##  3  1950 Moanin' the Blues                          1
##  4  1950 Why Don't You Love Me                      1
##  5  1951 Cold, Cold Heart                           1
##  6  1951 Hey Good Lookin'                           1
##  7  1952 I'll Never Get Out of This World Alive     1
##  8  1952 Jambalaya (On the Bayou)                   1
##  9  1953 Kaw-Liga                                   1
## 10  1953 Take These Chains from My Heart            1
## 11  1953 Your Cheatin' Heart                        1

Lists of columns

## List of 3
##  $ year: int [1:67] 1947 1947 1947 1947 1947 1947 1948 1948 1948 1948 ...
##  $ song: chr [1:67] "Move It On Over" "My Love for You (Has Turned to Hate)" "Never Again (Will I Knock on Your Door)" "On the Banks of the Old Ponchartrain" ...
##  $ peak: int [1:67] 4 NA NA NA NA NA 12 14 NA 2 ...
## # A tibble: 11 x 3
##     year song                                    peak
##    <int> <chr>                                  <int>
##  1  1949 Lovesick Blues                             1
##  2  1950 Long Gone Lonesome Blues                   1
##  3  1950 Moanin' the Blues                          1
##  4  1950 Why Don't You Love Me                      1
##  5  1951 Cold, Cold Heart                           1
##  6  1951 Hey Good Lookin'                           1
##  7  1952 I'll Never Get Out of This World Alive     1
##  8  1952 Jambalaya (On the Bayou)                   1
##  9  1953 Kaw-Liga                                   1
## 10  1953 Take These Chains from My Heart            1
## 11  1953 Your Cheatin' Heart                        1

List of rows (data frames)

## List of 10
##  $ Got to Be There :Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "Ain'T No Sunshine" "I Wanna be Where You Are" "Girl Don't Take Your Love from Me" "In Our Small Way" ...
##   ..$ peak: int [1:10] NA NA NA NA 4 2 NA NA NA NA
##  $ Ben             :Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "Ben" "Greatest Show On Earth" "People Make the World Go 'Round" "We've Got a Good Thing" ...
##   ..$ peak: int [1:10] 1 NA NA NA NA NA NA NA NA NA
##  $ Music & Me      :Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "With a Child's Heart" "Up Again" "All the Things You Are" "Happy" ...
##   ..$ peak: int [1:10] 50 NA NA NA NA NA NA NA NA NA
##  $ Forever, Michael:Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "We're Almost There" "Take Me Back" "One Day in Your Life" "Cinderella Stay Awhile" ...
##   ..$ peak: int [1:10] 54 NA NA NA NA 23 NA NA NA NA
##  $ Off the Wall    :Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "Don't Stop 'Til You Get Enough" "Rock with You" "Working Day and Night" "Get on the Floor" ...
##   ..$ peak: int [1:10] 1 1 NA NA 10 NA 10 NA NA NA
##  $ Thriller        :Classes 'tbl_df', 'tbl' and 'data.frame':    9 obs. of  2 variables:
##   ..$ song: chr [1:9] "Wanna Be Startin' Somethin" "Baby be Mine" "The Girl is Mine" "Thriller" ...
##   ..$ peak: int [1:9] 5 NA 2 4 1 1 7 10 NA
##  $ Bad             :Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##   ..$ song: chr [1:10] "Bad" "The Way You Make Me Feel" "Speed Demon" "Liberian Girl" ...
##   ..$ peak: int [1:10] 1 1 NA NA NA 11 1 1 1 7
##  $ Dangerous       :Classes 'tbl_df', 'tbl' and 'data.frame':    14 obs. of  2 variables:
##   ..$ song: chr [1:14] "Jam" "Why You Wanna Trip on Me" "In the Closet" "She Drives Me Wild" ...
##   ..$ peak: int [1:14] 26 NA 6 NA 3 NA 27 1 14 NA ...
##  $ HIStory         :Classes 'tbl_df', 'tbl' and 'data.frame':    30 obs. of  2 variables:
##   ..$ song: chr [1:30] "Billie Jean" "The Way You Make Me Feel" "Black or White" "Rock with You" ...
##   ..$ peak: int [1:30] NA NA NA NA NA NA NA NA NA NA ...
##  $ Invincible      :Classes 'tbl_df', 'tbl' and 'data.frame':    16 obs. of  2 variables:
##   ..$ song: chr [1:16] "Unbreakable" "Heartbreaker" "Invincible" "Break of Dawn" ...
##   ..$ peak: int [1:16] NA NA NA NA NA 10 NA NA NA NA ...
## # A tibble: 16 x 2
## # Groups:   album [10]
##    album            song                          
##    <chr>            <chr>                         
##  1 Got to Be There  Rockin' Robin                 
##  2 Ben              Ben                           
##  3 Music & Me       With a Child's Heart          
##  4 Forever, Michael Just a Little Bit of You      
##  5 Off the Wall     Don't Stop 'Til You Get Enough
##  6 Off the Wall     Rock with You                 
##  7 Thriller         Beat It                       
##  8 Thriller         Billie Jean                   
##  9 Bad              Bad                           
## 10 Bad              The Way You Make Me Feel      
## 11 Bad              Man in the Mirror             
## 12 Bad              I Just Can't Stop Loving You  
## 13 Bad              Dirty Diana                   
## 14 Dangerous        Black or White                
## 15 HIStory          You Are Not Alone             
## 16 Invincible       You Rock My World

Working with data types

  • Be aware of data types when combining vectors or data frames
    • R will need to make a decision about the resulting type if the original types differ for a vector/column

  • 6 atomic data types in R
    • logical
    • character
    • double
    • integer
    • complex
    • raw
## [1] "logical"
## [1] "character"
## [1] "double"
## [1] "integer"
## [1] "complex"
## [1] "raw"

Atomic data types

  • R will coerce logicals and numerics to strings if strings are present
  • R will coerce logicals to numerics if numerics are present
  • as.character() will coerce factors to their labels as strings if strings are present
  • as.numeric() will coerce factors to their levels as numerics if numerics are present

dplyr’s coercion rules

Base R coercion rules:

  • in general the more specific type of data will convert to the more general
    • integer and logical will go to integer
    • double and integer/locgical will go to double
    • character and integer/double/logical will go to character
  • factors are tricky
    • if joining to an integer or double the factor integer values will be used
    • if joining to a character the factor labels will be used
    • be careful if you have numeric labels for your factors

dplyr’s coercion rules:

  • dplyr functions will not automatically coerce data types
    • returns and error
    • expects you to manyally coerce data
  • Exception: factors
    • dplyr coverts non-aligning factors to strings
    • gives warning message

dplyr and coercion

  • dplyr’s two table functions will
    • coerce two factors to a string if the factors have different levels
    • coerce two factors to a string if the factors have the same levels in a different order
    • return an error instead of coercing logicals, integers, and numerics to a character

Determining type

  • eighties has the year variable stores as a factor
  • sevnties has it stored as a character
  • when combined it will be a character
    • we will get a warning message
## # A tibble: 10 x 3
##    year  album                      band                
##    <fct> <chr>                      <chr>               
##  1 1970  Bridge Over Troubled Water Simon and Garfunkel 
##  2 1971  Jesus Christ Superstar     Various Artists     
##  3 1972  Harvest                    Neil Young          
##  4 1973  The World is a Ghetto      War                 
##  5 1974  Goodbye Yellow Brick Road  Elton John          
##  6 1975  Elton John's Greatest Hits Elton John          
##  7 1976  Peter Frampton             Frampton Comes Alive
##  8 1977  Rumours                    Fleetwood Mac       
##  9 1978  Saturday Night Fever       Bee Gees            
## 10 1979  Billy Joel                 52nd Street
## # A tibble: 10 x 3
##    year  album              band             
##    <chr> <chr>              <chr>            
##  1 1980  The Wall           Pink Floyd       
##  2 1981  Hi Infidelity      REO Speedwagon   
##  3 1982  Asia               Asia             
##  4 1983  Thriller           Michael Jackson  
##  5 1984  Thriller           Michael Jackson  
##  6 1985  Born in the U.S.A. Bruce Springsteen
##  7 1986  Whitney Houston    Whitney Houston  
##  8 1987  Slippery When Wet  Bon Jovi         
##  9 1988  Faith              George Michael   
## 10 1989  Don't Be Cruel     Bobby Brown
## Warning in bind_rows_(x, .id): binding factor and character vector, coercing
## into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector
## # A tibble: 6 x 3
##   year  album                      band               
##   <chr> <chr>                      <chr>              
## 1 1970  Bridge Over Troubled Water Simon and Garfunkel
## 2 1971  Jesus Christ Superstar     Various Artists    
## 3 1972  Harvest                    Neil Young         
## 4 1973  The World is a Ghetto      War                
## 5 1974  Goodbye Yellow Brick Road  Elton John         
## 6 1975  Elton John's Greatest Hits Elton John

Results

## Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  3 variables:
##  $ year : num  1960 1961 1962 1963 1964 ...
##  $ album: chr  "The Sound of Music" "Camelot" "West Side Story" "West Side Story" ...
##  $ band : chr  "Original Broadway Cast" "Original Broadway Cast" "Soundtrack" "Soundtrack" ...
## Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  3 variables:
##  $ year : Factor w/ 10 levels "1970","1971",..: 1 2 3 4 5 6 7 8 9 10
##  $ album: chr  "Bridge Over Troubled Water" "Jesus Christ Superstar" "Harvest" "The World is a Ghetto" ...
##  $ band : chr  "Simon and Garfunkel" "Various Artists" "Neil Young" "War" ...
## Error: Column `year` can't be converted from factor to numeric
## # A tibble: 10 x 3
##     year album                      band                
##    <dbl> <chr>                      <chr>               
##  1     1 Bridge Over Troubled Water Simon and Garfunkel 
##  2     2 Jesus Christ Superstar     Various Artists     
##  3     3 Harvest                    Neil Young          
##  4     4 The World is a Ghetto      War                 
##  5     5 Goodbye Yellow Brick Road  Elton John          
##  6     6 Elton John's Greatest Hits Elton John          
##  7     7 Peter Frampton             Frampton Comes Alive
##  8     8 Rumours                    Fleetwood Mac       
##  9     9 Saturday Night Fever       Bee Gees            
## 10    10 Billy Joel                 52nd Street
## # A tibble: 10 x 3
##     year album                      band                
##    <dbl> <chr>                      <chr>               
##  1  1970 Bridge Over Troubled Water Simon and Garfunkel 
##  2  1971 Jesus Christ Superstar     Various Artists     
##  3  1972 Harvest                    Neil Young          
##  4  1973 The World is a Ghetto      War                 
##  5  1974 Goodbye Yellow Brick Road  Elton John          
##  6  1975 Elton John's Greatest Hits Elton John          
##  7  1976 Peter Frampton             Frampton Comes Alive
##  8  1977 Rumours                    Fleetwood Mac       
##  9  1978 Saturday Night Fever       Bee Gees            
## 10  1979 Billy Joel                 52nd Street
## # A tibble: 20 x 3
##     year album                          band                           
##    <dbl> <chr>                          <chr>                          
##  1  1960 The Sound of Music             Original Broadway Cast         
##  2  1961 Camelot                        Original Broadway Cast         
##  3  1962 West Side Story                Soundtrack                     
##  4  1963 West Side Story                Soundtrack                     
##  5  1964 Hello, Dolly!                  Original Broadway Cast         
##  6  1965 Mary Poppins                   Soundtrack                     
##  7  1966 Whipped Cream & Other Delights Herb Alpert & The Tijuana Brass
##  8  1967 More of The Monkees            The Monkees                    
##  9  1968 Are You Experienced?           The Jimi Hendrix Experience    
## 10  1969 In-A-Gadda-Da-Vida             Iron Butterfly                 
## 11  1970 Bridge Over Troubled Water     Simon and Garfunkel            
## 12  1971 Jesus Christ Superstar         Various Artists                
## 13  1972 Harvest                        Neil Young                     
## 14  1973 The World is a Ghetto          War                            
## 15  1974 Goodbye Yellow Brick Road      Elton John                     
## 16  1975 Elton John's Greatest Hits     Elton John                     
## 17  1976 Peter Frampton                 Frampton Comes Alive           
## 18  1977 Rumours                        Fleetwood Mac                  
## 19  1978 Saturday Night Fever           Bee Gees                       
## 20  1979 Billy Joel                     52nd Street

   


Advanced Joining


What can go wrong?

  • This chapter is about becoming a master at joining data
  • The main problems with joins come from missing or duplicate key values or columns
  • NA in the key column?
    • Probably just need to remove these rows
  • Missing key column?
    • it might be in the row name. use rownames_to_column in tibble package
  • duplicate keys?
    • Might just need to include another column in the key criteria
    • If you really have duplicates be careful as you will get every combination of key value matches across datasets.

Spot the key

##                                     musical year
## Children Will Listen         Into the Woods 1986
## Maria                       West Side Story 1957
## Memory                                 Cats 1981
## The Music of the Night Phantom of the Opera 1986
##                     song            composer
## 1   Children Will Listen    Stephen Sondheim
## 2                  Maria   Leonard Bernstein
## 3                 Memory Andrew Lloyd Webber
## 4 The Music of the Night Andrew Lloyd Webber
##                     song              musical year            composer
## 1   Children Will Listen       Into the Woods 1986    Stephen Sondheim
## 2                  Maria      West Side Story 1957   Leonard Bernstein
## 3                 Memory                 Cats 1981 Andrew Lloyd Webber
## 4 The Music of the Night Phantom of the Opera 1986 Andrew Lloyd Webber

Non-unique keys

  • Non unique keys can be okay if you want the info for each entry and don’t mind it being duplicated.
    • like years will get duplicated here because there are non unique keys in the musical column of composers dataset
## # A tibble: 6 x 2
##   musical               year
##   <chr>                <int>
## 1 The Sound of Music    1959
## 2 Into the Woods        1986
## 3 The King and I        1951
## 4 West Side Story       1957
## 5 Cats                  1981
## 6 Phantom of the Opera  1986
## # A tibble: 8 x 2
##   musical              composer            
##   <chr>                <chr>               
## 1 Cats                 Andrew Lloyd Webber 
## 2 Into the Woods       Stephen Sondheim    
## 3 Phantom of the Opera Andrew Lloyd Webber 
## 4 The King and I       Richard Rogers      
## 5 The King and I       Oscar Hammerstein II
## 6 The Sound of Music   Richard Rogers      
## 7 The Sound of Music   Oscar Hammerstein II
## 8 West Side Story      Leonard Bernstein
## # A tibble: 8 x 3
##   musical               year composer            
##   <chr>                <int> <chr>               
## 1 The Sound of Music    1959 Richard Rogers      
## 2 The Sound of Music    1959 Oscar Hammerstein II
## 3 Into the Woods        1986 Stephen Sondheim    
## 4 The King and I        1951 Richard Rogers      
## 5 The King and I        1951 Oscar Hammerstein II
## 6 West Side Story       1957 Leonard Bernstein   
## 7 Cats                  1981 Andrew Lloyd Webber 
## 8 Phantom of the Opera  1986 Andrew Lloyd Webber

Two non-unique keys

  • But in this case its kind of a problem becuse the key is duplicated in both show_songs and composers datasets
    • There are multiple songs in a musical and multiple composers for a musical
    • We end up with two rows for every song if there were two composers for that musical
## # A tibble: 8 x 3
##   song                        musical               year
##   <chr>                       <chr>                <int>
## 1 A Few of My Favorite Things The Sound of Music    1959
## 2 Children Will Listen        Into the Woods        1986
## 3 Edelweiss                   The Sound of Music    1959
## 4 Getting to Know You         The King and I        1951
## 5 Maria                       West Side Story       1957
## 6 Memory                      Cats                  1981
## 7 Shall We Dance?             The King and I        1951
## 8 The Music of the Night      Phantom of the Opera  1986
## # A tibble: 12 x 4
##    song                        musical               year composer            
##    <chr>                       <chr>                <int> <chr>               
##  1 Memory                      Cats                  1981 Andrew Lloyd Webber 
##  2 Children Will Listen        Into the Woods        1986 Stephen Sondheim    
##  3 The Music of the Night      Phantom of the Opera  1986 Andrew Lloyd Webber 
##  4 Getting to Know You         The King and I        1951 Richard Rogers      
##  5 Getting to Know You         The King and I        1951 Oscar Hammerstein II
##  6 Shall We Dance?             The King and I        1951 Richard Rogers      
##  7 Shall We Dance?             The King and I        1951 Oscar Hammerstein II
##  8 A Few of My Favorite Things The Sound of Music    1959 Richard Rogers      
##  9 A Few of My Favorite Things The Sound of Music    1959 Oscar Hammerstein II
## 10 Edelweiss                   The Sound of Music    1959 Richard Rogers      
## 11 Edelweiss                   The Sound of Music    1959 Oscar Hammerstein II
## 12 Maria                       West Side Story       1957 Leonard Bernstein

Missing keys

  • Missing keys can mistakenly join two rows that do not belong toegether
## # A tibble: 2 x 2
##   movie              singer               
##   <chr>              <chr>                
## 1 <NA>               Arnold Schwarzenegger
## 2 The Sound of Music Julie Andrews
## # A tibble: 2 x 2
##   song                movie             
##   <chr>               <chr>             
## 1 Do-Re-Mi            The Sound of Music
## 2 A Spoonful of Sugar <NA>
## # A tibble: 2 x 3
##   song                movie              singer               
##   <chr>               <chr>              <chr>                
## 1 Do-Re-Mi            The Sound of Music Julie Andrews        
## 2 A Spoonful of Sugar <NA>               Arnold Schwarzenegger
## # A tibble: 1 x 3
##   song     movie              singer       
##   <chr>    <chr>              <chr>        
## 1 Do-Re-Mi The Sound of Music Julie Andrews

Defining the keys

  • If you do not set a by argument dplyr will join on all columns that match across the two datasets
  • You can use named vetors to match column with different names by = c("name1"" = "name2")
  • If there are non-key columns with the same names they wil show up as name.x and name.y
    • You can change this by adding the suffix argument. suffix = c("name1","name2")

A subset of keys

  • The movie column is the key and name refers to different things in these two datasets
    • We will clean up the name columns after the join
## # A tibble: 10 x 3
##    movie                   name           year
##    <chr>                   <chr>         <int>
##  1 The Road to Morocco     Bing Crosby    1942
##  2 Going My Way            Bing Crosby    1944
##  3 Anchors Aweigh          Frank Sinatra  1945
##  4 Till the Clouds Roll By Frank Sinatra  1946
##  5 White Christmas         Bing Crosby    1954
##  6 The Tender Trap         Frank Sinatra  1955
##  7 High Society            Bing Crosby    1956
##  8 The Joker is Wild       Frank Sinatra  1957
##  9 Pal Joey                Frank Sinatra  1957
## 10 Can-Can                 Frank Sinatra  1960
## # A tibble: 10 x 2
##    movie                   name                 
##    <chr>                   <chr>                
##  1 The Road to Morocco     Paramount Pictures   
##  2 Going My Way            Paramount Pictures   
##  3 Anchors Aweigh          Metro-Goldwyn-Mayer  
##  4 Till the Clouds Roll By Metro-Goldwyn-Mayer  
##  5 White Christmas         Paramount Pictures   
##  6 The Tender Trap         Metro-Goldwyn-Mayer  
##  7 High Society            Metro-Goldwyn-Mayer  
##  8 The Joker is Wild       Paramount Pictures   
##  9 Pal Joey                Columbia Pictures    
## 10 Can-Can                 Twentieth-Century Fox
## # A tibble: 10 x 4
##    movie                   artist         year studio               
##    <chr>                   <chr>         <int> <chr>                
##  1 The Road to Morocco     Bing Crosby    1942 Paramount Pictures   
##  2 Going My Way            Bing Crosby    1944 Paramount Pictures   
##  3 Anchors Aweigh          Frank Sinatra  1945 Metro-Goldwyn-Mayer  
##  4 Till the Clouds Roll By Frank Sinatra  1946 Metro-Goldwyn-Mayer  
##  5 White Christmas         Bing Crosby    1954 Paramount Pictures   
##  6 The Tender Trap         Frank Sinatra  1955 Metro-Goldwyn-Mayer  
##  7 High Society            Bing Crosby    1956 Metro-Goldwyn-Mayer  
##  8 The Joker is Wild       Frank Sinatra  1957 Paramount Pictures   
##  9 Pal Joey                Frank Sinatra  1957 Columbia Pictures    
## 10 Can-Can                 Frank Sinatra  1960 Twentieth-Century Fox

Mis-matched key names

  • Here we use the named argument to create the correct join since the keys have different column names
    • This is a very common case in most relational databases. You will have something_dimension_id in the fact table and just id as the key in the dimension table
## # A tibble: 5 x 2
##   name                                 movie         
##   <chr>                                <chr>         
## 1 (You're So Square) Baby I Don't Care Jailhouse Rock
## 2 I Can't Help Falling in Love         Blue Hawaii   
## 3 Jailhouse Rock                       Jailhouse Rock
## 4 Viva Las Vegas                       Viva Las Vegas
## 5 You Don't Know Me                    Clambake
## # A tibble: 4 x 2
##   name            year
##   <chr>          <int>
## 1 Jailhouse Rock  1957
## 2 Blue Hawaii     1961
## 3 Viva Las Vegas  1963
## 4 Clambake        1967
## # A tibble: 5 x 3
##   movie           year song                                
##   <chr>          <int> <chr>                               
## 1 Jailhouse Rock  1957 (You're So Square) Baby I Don't Care
## 2 Jailhouse Rock  1957 Jailhouse Rock                      
## 3 Blue Hawaii     1961 I Can't Help Falling in Love        
## 4 Viva Las Vegas  1963 Viva Las Vegas                      
## 5 Clambake        1967 You Don't Know Me

More mismatched names

  • We can rename columns in the select statment
## # A tibble: 10 x 3
##    name                    director        studio               
##    <chr>                   <chr>           <chr>                
##  1 Anchors Aweigh          George Sidney   Metro-Goldwyn-Mayer  
##  2 Can-Can                 Walter Lang     Twentieth-Century Fox
##  3 Going My Way            Leo McCarey     Paramount Pictures   
##  4 High Society            Charles Walters Metro-Goldwyn-Mayer  
##  5 Pal Joey                George Sidney   Columbia Pictures    
##  6 The Joker is Wild       Charles Vidor   Paramount Pictures   
##  7 The Road to Morocco     David Butler    Paramount Pictures   
##  8 The Tender Trap         Charles Walters Metro-Goldwyn-Mayer  
##  9 Till the Clouds Roll By Richard Whorf   Metro-Goldwyn-Mayer  
## 10 White Christmas         Michael Curtiz  Paramount Pictures
## # A tibble: 10 x 3
##    movie                   name           year
##    <chr>                   <chr>         <int>
##  1 The Road to Morocco     Bing Crosby    1942
##  2 Going My Way            Bing Crosby    1944
##  3 Anchors Aweigh          Frank Sinatra  1945
##  4 Till the Clouds Roll By Frank Sinatra  1946
##  5 White Christmas         Bing Crosby    1954
##  6 The Tender Trap         Frank Sinatra  1955
##  7 High Society            Bing Crosby    1956
##  8 The Joker is Wild       Frank Sinatra  1957
##  9 Pal Joey                Frank Sinatra  1957
## 10 Can-Can                 Frank Sinatra  1960
## # A tibble: 10 x 5
##     year movie                  artist        director       studio             
##    <int> <chr>                  <chr>         <chr>          <chr>              
##  1  1942 The Road to Morocco    Bing Crosby   David Butler   Paramount Pictures 
##  2  1944 Going My Way           Bing Crosby   Leo McCarey    Paramount Pictures 
##  3  1945 Anchors Aweigh         Frank Sinatra George Sidney  Metro-Goldwyn-Mayer
##  4  1946 Till the Clouds Roll ~ Frank Sinatra Richard Whorf  Metro-Goldwyn-Mayer
##  5  1954 White Christmas        Bing Crosby   Michael Curtiz Paramount Pictures 
##  6  1955 The Tender Trap        Frank Sinatra Charles Walte~ Metro-Goldwyn-Mayer
##  7  1956 High Society           Bing Crosby   Charles Walte~ Metro-Goldwyn-Mayer
##  8  1957 The Joker is Wild      Frank Sinatra Charles Vidor  Paramount Pictures 
##  9  1957 Pal Joey               Frank Sinatra George Sidney  Columbia Pictures  
## 10  1960 Can-Can                Frank Sinatra Walter Lang    Twentieth-Century ~

Joining multiple tables

  • To join multiple tables we could do a bunch of left_joins in a row
  • The purrr package has a function reduce() that lets us apply a function recursively to a list of datasets
  • purrr is also written by hadley wickham so it will work well with dplyr functions
  • the purpose of purrr is to help us apply R functions to data in efficient ways

Join multiple tables

  • Using reduce to do a left_quick join of all tables
## # A tibble: 19 x 5
##    supergroup               first   last       band                   instrument
##    <chr>                    <chr>   <chr>      <chr>                  <chr>     
##  1 Cream                    Ginger  Baker      Graham Bond Organisat~ drums     
##  2 Cream                    Jack    Bruce      Graham Bond Organisat~ bass      
##  3 Cream                    Jack    Bruce      Graham Bond Organisat~ bass      
##  4 Cream                    Jack    Bruce      Manfred Mann           bass      
##  5 Cream                    Jack    Bruce      Manfred Mann           bass      
##  6 Cream                    Eric    Clapton    The Yardbirds          guitar    
##  7 Crosby, Stills, Nash, a~ David   Crosby     The Byrds              guitar    
##  8 Crosby, Stills, Nash, a~ Graham  Nash       The Hollies            guitar    
##  9 Crosby, Stills, Nash, a~ Stephen Stills     Buffalo Springfield    guitar    
## 10 Crosby, Stills, Nash, a~ Neil    Young      Buffalo Springfield    guitar    
## 11 The Highwaymen           Johnny  Cash       <NA>                   guitar    
## 12 The Highwaymen           Waylon  Jennings   <NA>                   guitar    
## 13 The Highwaymen           Kris    Kristoffe~ <NA>                   guitar    
## 14 The Highwaymen           Willie  Nelson     <NA>                   guitar    
## 15 The Traveling Wilburys   Bob     Dylan      <NA>                   guitar    
## 16 The Traveling Wilburys   George  Harrison   The Beatles            guitar    
## 17 The Traveling Wilburys   Jeff    Lynne      Electric Light Orches~ vocals    
## 18 The Traveling Wilburys   Roy     Orbison    <NA>                   guitar    
## 19 The Traveling Wilburys   Tom     Petty      Tom Petty and the Hea~ vocals

Filter multiple tables

  • Here we use reduce to do a filtering semi_join
## # A tibble: 11 x 3
##    first   last     instrument
##    <chr>   <chr>    <chr>     
##  1 David   Crosby   guitar    
##  2 Eric    Clapton  guitar    
##  3 George  Harrison guitar    
##  4 Ginger  Baker    drums     
##  5 Graham  Nash     guitar    
##  6 Jack    Bruce    bass      
##  7 Jack    Bruce    bass      
##  8 Jeff    Lynne    vocals    
##  9 Neil    Young    guitar    
## 10 Stephen Stills   guitar    
## 11 Tom     Petty    vocals

Other implementations

  • We can do the same joins as we have been using here wih the merge function, but the syntax is more confusing
    • I will always just use dplyr joins
  • The dplyr joins translate to sql joins as show
    • These are good to know as often I am exploring a sql database before pulling data into R. It happens.

  • These are the DBMSs (Database management systems) you can easily connect to from R
    • Once you have connected you can write dplyr code and it will be translated to sql and run in the DBMS
    • This is incredible useful. Its worthy of its own corse.
    • checkout vignette("dbplyr") to see more about dplyr’s database capabilities
    • We can use collect to pull the data into R when we are ready or data.frame will also pull it in.

SQL

  • Each dplyr join has an SQL equivalent, and when you apply a dplyr join to a SQL data source (instead of a data frame), dplyr automatically converts your join to its SQL equivalent and runs it in the database.

Base R

##        first      last               band instrument
## 1      Jimmy      Page       Led Zeppelin     Guitar
## 2       John    Bonham       Led Zeppelin       <NA>
## 3  John Paul     Jones       Led Zeppelin       <NA>
## 4     Robert     Plant       Led Zeppelin       <NA>
## 5     George  Harrison        The Beatles     Guitar
## 6       John    Lennon        The Beatles     Guitar
## 7       Paul McCartney        The Beatles       Bass
## 8      Ringo     Starr        The Beatles      Drums
## 9      Jimmy   Buffett  The Coral Reefers     Guitar
## 10   Charlie     Watts The Rolling Stones       <NA>
## 11     Keith  Richards The Rolling Stones     Guitar
## 12      Mick    Jagger The Rolling Stones     Vocals
## 13    Ronnie      Wood The Rolling Stones       <NA>
## # A tibble: 13 x 4
##    first     last      band               instrument
##    <chr>     <chr>     <chr>              <chr>     
##  1 John      Bonham    Led Zeppelin       <NA>      
##  2 John Paul Jones     Led Zeppelin       <NA>      
##  3 Jimmy     Page      Led Zeppelin       Guitar    
##  4 Robert    Plant     Led Zeppelin       <NA>      
##  5 George    Harrison  The Beatles        Guitar    
##  6 John      Lennon    The Beatles        Guitar    
##  7 Paul      McCartney The Beatles        Bass      
##  8 Ringo     Starr     The Beatles        Drums     
##  9 Jimmy     Buffett   The Coral Reefers  Guitar    
## 10 Mick      Jagger    The Rolling Stones Vocals    
## 11 Keith     Richards  The Rolling Stones Guitar    
## 12 Charlie   Watts     The Rolling Stones <NA>      
## 13 Ronnie    Wood      The Rolling Stones <NA>

   


Case Study


Lahman’s Baseball Database

  • library(Lahman)
  • Sean Lahman’s baseball database
  • 26 tables
  • A great example of real worl relational data

Universal keys?

  • In particular, do any variable names span all of the datasets?
## List of 28
##  $ AllstarFull        :Classes 'tbl_df', 'tbl' and 'data.frame': 8 obs. of  1 variable:
##   ..$ var: chr [1:8] "playerID" "yearID" "gameNum" "gameID" ...
##  $ Appearances        :Classes 'tbl_df', 'tbl' and 'data.frame': 21 obs. of  1 variable:
##   ..$ var: chr [1:21] "yearID" "teamID" "lgID" "playerID" ...
##  $ AwardsManagers     :Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of  1 variable:
##   ..$ var: chr [1:6] "playerID" "awardID" "yearID" "lgID" ...
##  $ AwardsPlayers      :Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of  1 variable:
##   ..$ var: chr [1:6] "playerID" "awardID" "yearID" "lgID" ...
##  $ AwardsShareManagers:Classes 'tbl_df', 'tbl' and 'data.frame': 7 obs. of  1 variable:
##   ..$ var: chr [1:7] "awardID" "yearID" "lgID" "playerID" ...
##  $ AwardsSharePlayers :Classes 'tbl_df', 'tbl' and 'data.frame': 7 obs. of  1 variable:
##   ..$ var: chr [1:7] "awardID" "yearID" "lgID" "playerID" ...
##  $ Batting            :Classes 'tbl_df', 'tbl' and 'data.frame': 22 obs. of  1 variable:
##   ..$ var: chr [1:22] "playerID" "yearID" "stint" "teamID" ...
##  $ battingLabels      :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  1 variable:
##   ..$ var: chr [1:2] "variable" "label"
##  $ BattingPost        :Classes 'tbl_df', 'tbl' and 'data.frame': 22 obs. of  1 variable:
##   ..$ var: chr [1:22] "yearID" "round" "playerID" "teamID" ...
##  $ CollegePlaying     :Classes 'tbl_df', 'tbl' and 'data.frame': 3 obs. of  1 variable:
##   ..$ var: chr [1:3] "playerID" "schoolID" "yearID"
##  $ Fielding           :Classes 'tbl_df', 'tbl' and 'data.frame': 18 obs. of  1 variable:
##   ..$ var: chr [1:18] "playerID" "yearID" "stint" "teamID" ...
##  $ fieldingLabels     :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  1 variable:
##   ..$ var: chr [1:2] "variable" "label"
##  $ FieldingOF         :Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of  1 variable:
##   ..$ var: chr [1:6] "playerID" "yearID" "stint" "Glf" ...
##  $ FieldingPost       :Classes 'tbl_df', 'tbl' and 'data.frame': 17 obs. of  1 variable:
##   ..$ var: chr [1:17] "playerID" "yearID" "teamID" "lgID" ...
##  $ HallOfFame         :Classes 'tbl_df', 'tbl' and 'data.frame': 9 obs. of  1 variable:
##   ..$ var: chr [1:9] "playerID" "yearID" "votedBy" "ballots" ...
##  $ LahmanData         :Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of  1 variable:
##   ..$ var: chr [1:5] "file" "class" "nobs" "nvar" ...
##  $ Managers           :Classes 'tbl_df', 'tbl' and 'data.frame': 10 obs. of  1 variable:
##   ..$ var: chr [1:10] "playerID" "yearID" "teamID" "lgID" ...
##  $ ManagersHalf       :Classes 'tbl_df', 'tbl' and 'data.frame': 10 obs. of  1 variable:
##   ..$ var: chr [1:10] "playerID" "yearID" "teamID" "lgID" ...
##  $ Master             :Classes 'tbl_df', 'tbl' and 'data.frame': 26 obs. of  1 variable:
##   ..$ var: chr [1:26] "playerID" "birthYear" "birthMonth" "birthDay" ...
##  $ Pitching           :Classes 'tbl_df', 'tbl' and 'data.frame': 30 obs. of  1 variable:
##   ..$ var: chr [1:30] "playerID" "yearID" "stint" "teamID" ...
##  $ pitchingLabels     :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  1 variable:
##   ..$ var: chr [1:2] "variable" "label"
##  $ PitchingPost       :Classes 'tbl_df', 'tbl' and 'data.frame': 30 obs. of  1 variable:
##   ..$ var: chr [1:30] "playerID" "yearID" "round" "teamID" ...
##  $ Salaries           :Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of  1 variable:
##   ..$ var: chr [1:5] "yearID" "teamID" "lgID" "playerID" ...
##  $ Schools            :Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of  1 variable:
##   ..$ var: chr [1:5] "schoolID" "name_full" "city" "state" ...
##  $ SeriesPost         :Classes 'tbl_df', 'tbl' and 'data.frame': 9 obs. of  1 variable:
##   ..$ var: chr [1:9] "yearID" "round" "teamIDwinner" "lgIDwinner" ...
##  $ Teams              :Classes 'tbl_df', 'tbl' and 'data.frame': 48 obs. of  1 variable:
##   ..$ var: chr [1:48] "yearID" "lgID" "teamID" "franchID" ...
##  $ TeamsFranchises    :Classes 'tbl_df', 'tbl' and 'data.frame': 4 obs. of  1 variable:
##   ..$ var: chr [1:4] "franchID" "franchName" "active" "NAassoc"
##  $ TeamsHalf          :Classes 'tbl_df', 'tbl' and 'data.frame': 10 obs. of  1 variable:
##   ..$ var: chr [1:10] "yearID" "lgID" "teamID" "Half" ...
## # A tibble: 0 x 1
## # ... with 1 variable: var <chr>

Common keys?

## # A tibble: 59 x 2
##    var          n
##    <chr>    <int>
##  1 yearID      21
##  2 playerID    19
##  3 lgID        17
##  4 teamID      13
##  5 G           10
##  6 L            6
##  7 W            6
##  8 BB           5
##  9 CS           5
## 10 GS           5
## # ... with 49 more rows

playerID

##  [1] "AllstarFull"         "Appearances"         "AwardsManagers"     
##  [4] "AwardsPlayers"       "AwardsShareManagers" "AwardsSharePlayers" 
##  [7] "Batting"             "BattingPost"         "CollegePlaying"     
## [10] "Fielding"            "FieldingOF"          "FieldingPost"       
## [13] "HallOfFame"          "Managers"            "ManagersHalf"       
## [16] "Master"              "Pitching"            "PitchingPost"       
## [19] "Salaries"

Salaries

Missing salaries

  • How many players are completely missing salary information?
##    playerID nameFirst nameLast
## 1 aardsda01     David  Aardsma
## 2 aaronha01      Hank    Aaron
## 3 aaronto01    Tommie    Aaron
## 4  aasedo01       Don     Aase
## 5  abadan01      Andy     Abad
## 6  abadfe01  Fernando     Abad
##   yearID teamID lgID  playerID salary
## 1   1985    ATL   NL barkele01 870000
## 2   1985    ATL   NL bedrost01 550000
## 3   1985    ATL   NL benedbr01 545000
## 4   1985    ATL   NL  campri01 633333
## 5   1985    ATL   NL ceronri01 625000
## 6   1985    ATL   NL chambch01 800000
## # A tibble: 1 x 1
##       n
##   <int>
## 1 14468

Unpaid games?

  • Is it possible that these players somehow did not play (and hence did not earn a salary)?
##   yearID teamID lgID  playerID G_all GS G_batting G_defense G_p G_c G_1b G_2b
## 1   1871    TRO   NA abercda01     1  1         1         1   0   0    0    0
## 2   1871    RC1   NA  addybo01    25 25        25        25   0   0    0   22
## 3   1871    CL1   NA allisar01    29 29        29        29   0   0    0    2
## 4   1871    WS3   NA allisdo01    27 27        27        27   0  27    0    0
## 5   1871    RC1   NA ansonca01    25 25        25        25   0   5    1    2
## 6   1871    FW1   NA armstbo01    12 12        12        12   0   0    0    0
##   G_3b G_ss G_lf G_cf G_rf G_of G_dh G_ph G_pr
## 1    0    1    0    0    0    0    0    0    0
## 2    0    3    0    0    0    0    0    0    0
## 3    0    0    0   29    0   29    0    0    0
## 4    0    0    0    0    0    0    0    0    0
## 5   20    0    1    0    0    1    0    0    0
## 6    0    0    0   11    1   12    0    0    0
## # A tibble: 1 x 1
##       n
##   <int>
## 1 14279

How many games?

  • Can you determine how many games each of these unsalaried players played?
## # A tibble: 14,468 x 2
##    playerID  total_games
##    <chr>           <int>
##  1 yastrca01        3308
##  2 aaronha01        3298
##  3 cobbty01         3034
##  4 musiast01        3026
##  5 mayswi01         2992
##  6 robinbr01        2896
##  7 kalinal01        2834
##  8 collied01        2825
##  9 robinfr02        2808
## 10 wagneho01        2797
## # ... with 14,458 more rows

How many at-bats?

Is it possible that the unsalaried players did not actually play in the games that they appeared in?

##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB SO
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4  0
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2  5
## 4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0  2
## 5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2  1
## 6 armstbo01   1871     1    FW1   NA 12  49  9 11   2   1  0   5  0  1  0  1
##   IBB HBP SH SF GIDP
## 1  NA  NA NA NA    0
## 2  NA  NA NA NA    0
## 3  NA  NA NA NA    1
## 4  NA  NA NA NA    0
## 5  NA  NA NA NA    0
## 6  NA  NA NA NA    0
## # A tibble: 14,468 x 2
##    playerID  total_at_bat
##    <chr>            <int>
##  1 aaronha01        12364
##  2 yastrca01        11988
##  3 cobbty01         11435
##  4 musiast01        10972
##  5 mayswi01         10881
##  6 robinbr01        10654
##  7 wagneho01        10439
##  8 brocklo01        10332
##  9 ansonca01        10281
## 10 aparilu01        10230
## # ... with 14,458 more rows

Introducing the hall of fame

Hall of fame nominations

  • Let’s see how many players have been nominated for the Hall of Fame.
##    playerID yearID votedBy ballots needed votes inducted category needed_note
## 1  cobbty01   1936   BBWAA     226    170   222        Y   Player        <NA>
## 2  ruthba01   1936   BBWAA     226    170   215        Y   Player        <NA>
## 3 wagneho01   1936   BBWAA     226    170   215        Y   Player        <NA>
## 4 mathech01   1936   BBWAA     226    170   205        Y   Player        <NA>
## 5 johnswa01   1936   BBWAA     226    170   189        Y   Player        <NA>
## 6 lajoina01   1936   BBWAA     226    170   146        N   Player        <NA>
## # A tibble: 1 x 1
##       n
##   <int>
## 1  1279
##    playerID nameFirst  nameLast
## 1  cobbty01        Ty      Cobb
## 2  ruthba01      Babe      Ruth
## 3 wagneho01     Honus    Wagner
## 4 mathech01   Christy Mathewson
## 5 johnswa01    Walter   Johnson
## 6 lajoina01       Nap    Lajoie

Hall of fame inductions

  • Let’s now see how many players were admitted to the hall of fame to examine how selective the voting process is.
##    playerID yearID    votedBy ballots needed votes inducted          category
## 1  cobbty01   1936      BBWAA     226    170   222        Y            Player
## 2  ruthba01   1936      BBWAA     226    170   215        Y            Player
## 3 wagneho01   1936      BBWAA     226    170   215        Y            Player
## 4 mathech01   1936      BBWAA     226    170   205        Y            Player
## 5 johnswa01   1936      BBWAA     226    170   189        Y            Player
## 6 bulkemo99   1937 Centennial      NA     NA    NA        Y Pioneer/Executive
##   needed_note
## 1        <NA>
## 2        <NA>
## 3        <NA>
## 4        <NA>
## 5        <NA>
## 6        <NA>
## # A tibble: 1 x 1
##       n
##   <int>
## 1   323
##    playerID nameFirst  nameLast
## 1  cobbty01        Ty      Cobb
## 2  ruthba01      Babe      Ruth
## 3 wagneho01     Honus    Wagner
## 4 mathech01   Christy Mathewson
## 5 johnswa01    Walter   Johnson
## 6 lajoina01       Nap    Lajoie

Retirement

  • One of the rules of the Hall of Fame is that players cannot be nominated until five years after they retire. Is this reflected in our data?
##   yearID teamID lgID  playerID G_all GS G_batting G_defense G_p G_c G_1b G_2b
## 1   1871    TRO   NA abercda01     1  1         1         1   0   0    0    0
## 2   1871    RC1   NA  addybo01    25 25        25        25   0   0    0   22
## 3   1871    CL1   NA allisar01    29 29        29        29   0   0    0    2
## 4   1871    WS3   NA allisdo01    27 27        27        27   0  27    0    0
## 5   1871    RC1   NA ansonca01    25 25        25        25   0   5    1    2
## 6   1871    FW1   NA armstbo01    12 12        12        12   0   0    0    0
##   G_3b G_ss G_lf G_cf G_rf G_of G_dh G_ph G_pr
## 1    0    1    0    0    0    0    0    0    0
## 2    0    3    0    0    0    0    0    0    0
## 3    0    0    0   29    0   29    0    0    0
## 4    0    0    0    0    0    0    0    0    0
## 5   20    0    1    0    0    1    0    0    0
## 6    0    0    0   11    1   12    0    0    0
## # A tibble: 39 x 10
##    playerID last_year yearID votedBy ballots needed votes inducted category
##    <chr>        <int>  <int> <chr>     <int>  <int> <int> <fct>    <fct>   
##  1 cissebi~      1938   1937 BBWAA       201    151     1 N        Player  
##  2 cochrmi~      1937   1936 BBWAA       226    170    80 N        Player  
##  3 deandi01      1947   1945 BBWAA       247    186    17 N        Player  
##  4 deandi01      1947   1946 Final ~     263    198    45 N        Player  
##  5 deandi01      1947   1946 Nomina~     202     NA    40 N        Player  
##  6 deandi01      1947   1947 BBWAA       161    121    88 N        Player  
##  7 dickebi~      1946   1945 BBWAA       247    186    17 N        Player  
##  8 dickebi~      1946   1946 Nomina~     202     NA    40 N        Player  
##  9 dickebi~      1946   1946 Final ~     263    198    32 N        Player  
## 10 dimagjo~      1951   1945 BBWAA       247    186     1 N        Player  
## # ... with 29 more rows, and 1 more variable: needed_note <chr>