Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

taylor_album_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-10-17/taylor_album_songs.csv')

## Rows: 194 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (7): album_name, track_name, artist, featuring, key_name, mode_name, k...
## dbl  (14): track_number, danceability, energy, key, loudness, mode, speechin...
## lgl   (4): ep, bonus_track, explicit, lyrics
## date  (4): album_release, promotional_release, single_release, track_release
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

taylor_all_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-10-17/taylor_all_songs.csv')

## Rows: 274 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (7): album_name, track_name, artist, featuring, key_name, mode_name, k...
## dbl  (14): track_number, danceability, energy, key, loudness, mode, speechin...
## lgl   (4): ep, bonus_track, explicit, lyrics
## date  (4): album_release, promotional_release, single_release, track_release
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1: Taylor Swift Album songs

Columns: album_name, single_release, track_name
Rows: 10

Data 2: Taylor Swift All songs

Columns: album_name, single_release, bonus_track
Rows: 10

set.seed(1234)
taylor_album_songs_small <- taylor_album_songs %>% select(album_name, single_release, track_name) %>% sample_n(10)
taylor_all_songs_small <- taylor_all_songs %>% select(album_name, single_release, bonus_track) %>% sample_n(10)
taylor_album_songs_small

## # A tibble: 10 × 3
##    album_name                  single_release track_name                        
##    <chr>                       <date>         <chr>                             
##  1 Fearless (Taylor's Version) NA             Change (Taylor's Version)         
##  2 Red (Taylor's Version)      NA             Better Man (Taylor's Version) [Fr…
##  3 folklore                    NA             epiphany                          
##  4 1989                        NA             Clean                             
##  5 reputation                  NA             So It Goes...                     
##  6 Lover                       NA             Daylight                          
##  7 Lover                       2019-06-14     You Need To Calm Down             
##  8 evermore                    NA             long story short                  
##  9 folklore                    NA             seven                             
## 10 Lover                       NA             False God

taylor_all_songs_small

## # A tibble: 10 × 3
##    album_name                  single_release bonus_track
##    <chr>                       <date>         <lgl>      
##  1 Red                         NA             TRUE       
##  2 Red (Taylor's Version)      NA             FALSE      
##  3 Midnights                   NA             FALSE      
##  4 Red                         NA             FALSE      
##  5 Red                         2013-06-21     FALSE      
##  6 folklore                    NA             FALSE      
##  7 Speak Now                   2011-07-18     FALSE      
##  8 Taylor Swift                NA             FALSE      
##  9 reputation                  NA             FALSE      
## 10 Fearless (Taylor's Version) NA             FALSE

3. inner_join

Describe the resulting data:

Columns: album_name, single_release, track_name, bonus_track
Rows: 7

How is it different from the original two datasets? * 7 Rows compared to 10 rows in the original dataset * all columns from the two datasets

*many-to-many relationship in this set

taylor_album_songs_small %>% inner_join(taylor_all_songs_small)

## Joining with `by = join_by(album_name, single_release)`

## # A tibble: 5 × 4
##   album_name                  single_release track_name              bonus_track
##   <chr>                       <date>         <chr>                   <lgl>      
## 1 Fearless (Taylor's Version) NA             Change (Taylor's Versi… FALSE      
## 2 Red (Taylor's Version)      NA             Better Man (Taylor's V… FALSE      
## 3 folklore                    NA             epiphany                FALSE      
## 4 reputation                  NA             So It Goes...           FALSE      
## 5 folklore                    NA             seven                   FALSE

4. left_join

Describe the resulting data:

Columns: album_name, single_release, track_name, bonus_track
Rows:12

How is it different from the original two datasets? * keeps all observations in album_name * preserves the original observations in album_name even when there isn’t a match * still has many-to-many relationship

left_join(taylor_album_songs_small, taylor_all_songs_small, by = "album_name")

## # A tibble: 10 × 5
##    album_name           single_release.x track_name single_release.y bonus_track
##    <chr>                <date>           <chr>      <date>           <lgl>      
##  1 Fearless (Taylor's … NA               Change (T… NA               FALSE      
##  2 Red (Taylor's Versi… NA               Better Ma… NA               FALSE      
##  3 folklore             NA               epiphany   NA               FALSE      
##  4 1989                 NA               Clean      NA               NA         
##  5 reputation           NA               So It Goe… NA               FALSE      
##  6 Lover                NA               Daylight   NA               NA         
##  7 Lover                2019-06-14       You Need … NA               NA         
##  8 evermore             NA               long stor… NA               NA         
##  9 folklore             NA               seven      NA               FALSE      
## 10 Lover                NA               False God  NA               NA

5. right_join

Describe the resulting data:

Columns: album_name, single_release, track_name, bonus_track
Rows:13

How is it different from the original two datasets? * joined by album_name and single_release * still has many-to-many relationship

right_join(taylor_album_songs_small, taylor_all_songs_small)

## Joining with `by = join_by(album_name, single_release)`

## # A tibble: 11 × 4
##    album_name                  single_release track_name             bonus_track
##    <chr>                       <date>         <chr>                  <lgl>      
##  1 Fearless (Taylor's Version) NA             Change (Taylor's Vers… FALSE      
##  2 Red (Taylor's Version)      NA             Better Man (Taylor's … FALSE      
##  3 folklore                    NA             epiphany               FALSE      
##  4 reputation                  NA             So It Goes...          FALSE      
##  5 folklore                    NA             seven                  FALSE      
##  6 Red                         NA             <NA>                   TRUE       
##  7 Midnights                   NA             <NA>                   FALSE      
##  8 Red                         NA             <NA>                   FALSE      
##  9 Red                         2013-06-21     <NA>                   FALSE      
## 10 Speak Now                   2011-07-18     <NA>                   FALSE      
## 11 Taylor Swift                NA             <NA>                   FALSE

6. full_join

Describe the resulting data:

Columns: album_name, single_release, track_name, bonus_track
Rows: 18

How is it different from the original two datasets? * keeps all observations from both data sets * still many-to-many relationship

full_join(taylor_album_songs_small, taylor_all_songs_small)

## Joining with `by = join_by(album_name, single_release)`

## # A tibble: 16 × 4
##    album_name                  single_release track_name             bonus_track
##    <chr>                       <date>         <chr>                  <lgl>      
##  1 Fearless (Taylor's Version) NA             Change (Taylor's Vers… FALSE      
##  2 Red (Taylor's Version)      NA             Better Man (Taylor's … FALSE      
##  3 folklore                    NA             epiphany               FALSE      
##  4 1989                        NA             Clean                  NA         
##  5 reputation                  NA             So It Goes...          FALSE      
##  6 Lover                       NA             Daylight               NA         
##  7 Lover                       2019-06-14     You Need To Calm Down  NA         
##  8 evermore                    NA             long story short       NA         
##  9 folklore                    NA             seven                  FALSE      
## 10 Lover                       NA             False God              NA         
## 11 Red                         NA             <NA>                   TRUE       
## 12 Midnights                   NA             <NA>                   FALSE      
## 13 Red                         NA             <NA>                   FALSE      
## 14 Red                         2013-06-21     <NA>                   FALSE      
## 15 Speak Now                   2011-07-18     <NA>                   FALSE      
## 16 Taylor Swift                NA             <NA>                   FALSE

7. semi_join

Describe the resulting data:

Columns: album_name, single_release, track_name
Rows: 5

How is it different from the original two datasets? * only five rows compared to ten in the original * smaller tibble 5X3, and kept all observations in taylor_album_songs_small that have a match in taylor_all_songs_small

semi_join(taylor_album_songs_small, taylor_all_songs_small)

## Joining with `by = join_by(album_name, single_release)`

## # A tibble: 5 × 3
##   album_name                  single_release track_name                         
##   <chr>                       <date>         <chr>                              
## 1 Fearless (Taylor's Version) NA             Change (Taylor's Version)          
## 2 Red (Taylor's Version)      NA             Better Man (Taylor's Version) [Fro…
## 3 folklore                    NA             epiphany                           
## 4 reputation                  NA             So It Goes...                      
## 5 folklore                    NA             seven

8. anti_join

Describe the resulting data:

Columns: album_name, single_release, track_name
Rows: 5

How is it different from the original two datasets? * dropped all observations in taylor_album_songs_small that have a match in taylor_all_songs_small * only five rows compared to ten in original

anti_join(taylor_album_songs_small, taylor_all_songs_small)

## Joining with `by = join_by(album_name, single_release)`

## # A tibble: 5 × 3
##   album_name single_release track_name           
##   <chr>      <date>         <chr>                
## 1 1989       NA             Clean                
## 2 Lover      NA             Daylight             
## 3 Lover      2019-06-14     You Need To Calm Down
## 4 evermore   NA             long story short     
## 5 Lover      NA             False God

Week 9: Apply it to your data 8

Alyssa D’Alessio

2024-10-07

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join