Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

tt_datasets <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-02/tt_datasets.csv')

## Rows: 644 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): dataset_name
## dbl (4): year, week, variables, observations
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

tt_summary <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-02/tt_summary.csv')

## Rows: 324 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): title, source_title, article_title
## dbl  (2): year, week
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1: tt_summary

Columns: year, week, title
Rows: 10 rows

Data 2: tt_datasets

Columns: year, week, variables
Rows: 10 rows

tt_summary_small <- tt_summary %>% select(year, week, title) %>% sample_n(10)
tt_datasets_small <- tt_datasets %>% select(year, week, variables) %>% sample_n(10)

tt_summary_small

## # A tibble: 10 × 3
##     year  week title                        
##    <dbl> <dbl> <chr>                        
##  1  2022    25 Juneteenth                   
##  2  2023    21 Central Park Squirrels       
##  3  2021    35 Lemurs                       
##  4  2020     4 Song Genres                  
##  5  2024    15 2023 & 2024 US Solar Eclipses
##  6  2019    23 Ramen Ratings                
##  7  2020     9 Measles Vaccination          
##  8  2023    34 Refugees                     
##  9  2018     3 Global Mortality             
## 10  2020    12 The Office

tt_datasets_small

## # A tibble: 10 × 3
##     year  week variables
##    <dbl> <dbl>     <dbl>
##  1  2024    24         5
##  2  2020    19         4
##  3  2022    43        24
##  4  2021    37         7
##  5  2020    27         4
##  6  2019    29        21
##  7  2019    18         1
##  8  2019    19         9
##  9  2020    14         6
## 10  2022    10        24

3. inner_join

Describe the resulting data:

Columns:year, week, title, variables
Rows: 4

How is it different from the original two datasets? * 4 row compared to 10 rows in the original datasets * all colums from the two datasets

tt_summary_small %>% inner_join(tt_datasets_small, by = c("year", "week"))

## # A tibble: 0 × 4
## # ℹ 4 variables: year <dbl>, week <dbl>, title <chr>, variables <dbl>

4. left_join

Describe the resulting data:

Columns:year, week, title, variables
Rows: 10 rows

How is it different from the original two datasets? * The orginal dataset variables are diffrent

tt_summary_small %>% left_join(tt_datasets_small, by = c("year", "week"))

## # A tibble: 10 × 4
##     year  week title                         variables
##    <dbl> <dbl> <chr>                             <dbl>
##  1  2022    25 Juneteenth                           NA
##  2  2023    21 Central Park Squirrels               NA
##  3  2021    35 Lemurs                               NA
##  4  2020     4 Song Genres                          NA
##  5  2024    15 2023 & 2024 US Solar Eclipses        NA
##  6  2019    23 Ramen Ratings                        NA
##  7  2020     9 Measles Vaccination                  NA
##  8  2023    34 Refugees                             NA
##  9  2018     3 Global Mortality                     NA
## 10  2020    12 The Office                           NA

5. right_join

Describe the resulting data:

Columns:year, week, title, variables
Rows: 10 rows

How is it different from the original two datasets? * The orginal dataset title is diffrent

tt_summary_small %>% right_join(tt_datasets_small, by = c("year", "week"))

## # A tibble: 10 × 4
##     year  week title variables
##    <dbl> <dbl> <chr>     <dbl>
##  1  2024    24 <NA>          5
##  2  2020    19 <NA>          4
##  3  2022    43 <NA>         24
##  4  2021    37 <NA>          7
##  5  2020    27 <NA>          4
##  6  2019    29 <NA>         21
##  7  2019    18 <NA>          1
##  8  2019    19 <NA>          9
##  9  2020    14 <NA>          6
## 10  2022    10 <NA>         24

6. full_join

Describe the resulting data:

Columns:year, week, title, variables
Rows: 19 rows

How is it different from the original two datasets? * There are 19 rows in this dataset instead of 10

tt_summary_small %>% full_join(tt_datasets_small, by = c("year", "week"))

## # A tibble: 20 × 4
##     year  week title                         variables
##    <dbl> <dbl> <chr>                             <dbl>
##  1  2022    25 Juneteenth                           NA
##  2  2023    21 Central Park Squirrels               NA
##  3  2021    35 Lemurs                               NA
##  4  2020     4 Song Genres                          NA
##  5  2024    15 2023 & 2024 US Solar Eclipses        NA
##  6  2019    23 Ramen Ratings                        NA
##  7  2020     9 Measles Vaccination                  NA
##  8  2023    34 Refugees                             NA
##  9  2018     3 Global Mortality                     NA
## 10  2020    12 The Office                           NA
## 11  2024    24 <NA>                                  5
## 12  2020    19 <NA>                                  4
## 13  2022    43 <NA>                                 24
## 14  2021    37 <NA>                                  7
## 15  2020    27 <NA>                                  4
## 16  2019    29 <NA>                                 21
## 17  2019    18 <NA>                                  1
## 18  2019    19 <NA>                                  9
## 19  2020    14 <NA>                                  6
## 20  2022    10 <NA>                                 24

7. semi_join

Describe the resulting data:

Columns:year, week, title, variables
Rows: 3

How is it different from the original two datasets? * 3 row compared to 10 rows in the original datasets

tt_summary_small %>% semi_join(tt_datasets_small, by = c("year", "week"))

## # A tibble: 0 × 3
## # ℹ 3 variables: year <dbl>, week <dbl>, title <chr>

8. anti_join

Describe the resulting data:

Columns:year, week, title, variables
Rows: 27

How is it different from the original two datasets? * 27 rows compared to 10 rows in the original datasets * All Info in dataset is there

tt_summary_small %>% anti_join(tt_datasets_small, by = c("year", "week"))

## # A tibble: 10 × 3
##     year  week title                        
##    <dbl> <dbl> <chr>                        
##  1  2022    25 Juneteenth                   
##  2  2023    21 Central Park Squirrels       
##  3  2021    35 Lemurs                       
##  4  2020     4 Song Genres                  
##  5  2024    15 2023 & 2024 US Solar Eclipses
##  6  2019    23 Ramen Ratings                
##  7  2020     9 Measles Vaccination          
##  8  2023    34 Refugees                     
##  9  2018     3 Global Mortality             
## 10  2020    12 The Office

Week 9: Apply it to your data 8

Ryan Winschel

2022-10-05

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join