Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

groundhogs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-01-30/groundhogs.csv')

## Rows: 75 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): slug, shortname, name, city, region, country, source, current_pred...
## dbl  (4): id, latitude, longitude, predictions_count
## lgl  (2): is_groundhog, active
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

predictions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-01-30/predictions.csv')

## Rows: 1462 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): details
## dbl (2): id, year
## lgl (1): shadow
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1: groundhogs

Columns: id, name, region, is_groundhog
Rows: 10 rows

Data 2: predictions

Columns: id, year, shadow
Rows: 10 rows

set.seed(1234)
groundhogs_small  <- groundhogs %>% select(id, name, region, is_groundhog) %>% sample_n(10) 
predictions_small <- predictions %>% select(id, year, shadow) %>% sample_n(10)

groundhogs_small

## # A tibble: 10 × 4
##       id name                 region               is_groundhog
##    <dbl> <chr>                <chr>                <lgl>       
##  1    28 Stonewall Jackson    New Jersey           TRUE        
##  2    22 Balzac Billy         Alberta              FALSE       
##  3     9 Gertie the Groundhog Illinois             TRUE        
##  4     5 Concord Charlie      West Virginia        FALSE       
##  5    38 Bowman Bill          Virginia             FALSE       
##  6    16 Shubenacadie Sam     Nova Scotia          TRUE        
##  7     4 Jimmy the Groundhog  Wisconsin            TRUE        
##  8    14 Manitoba Merv        Manitoba             FALSE       
##  9    56 Potomac Phil         District of Columbia FALSE       
## 10    62 Gordy the Groundhog  Wisconsin            TRUE

predictions_small

## # A tibble: 10 × 3
##       id  year shadow
##    <dbl> <dbl> <lgl> 
##  1     1  1889 NA    
##  2     9  2009 FALSE 
##  3    43  2015 TRUE  
##  4    23  2017 FALSE 
##  5    41  2020 FALSE 
##  6    26  2010 TRUE  
##  7     7  2011 FALSE 
##  8    11  2022 TRUE  
##  9    21  2006 TRUE  
## 10    10  1994 NA

3. inner_join

Describe the resulting data:

Columns: id, name, region, is_groundhog, year, shadow
Rows: 1

How is it different from the original two datasets? * 1 row compared to 10 in the origninal dataset * All the columns from both data sets are present

groundhogs_small %>% inner_join(predictions_small, by = c())

## Joining with `by = join_by(id)`

## # A tibble: 1 × 6
##      id name                 region   is_groundhog  year shadow
##   <dbl> <chr>                <chr>    <lgl>        <dbl> <lgl> 
## 1     9 Gertie the Groundhog Illinois TRUE          2009 FALSE

4. left_join

Describe the resulting data:

Columns: id, name, region, is_grounhog, year, shadow
Rows: 10

How is it different from the original two datasets? * Has all columns from both data sets present * Blocks out all data in the year, and shadow column except for the data that matches with both sets so id = 9 row is the only one that shows the values for year and shadow because id 9 shows up in both data sets

groundhogs_small %>% left_join(predictions_small, by = c())

## Joining with `by = join_by(id)`

## # A tibble: 10 × 6
##       id name                 region               is_groundhog  year shadow
##    <dbl> <chr>                <chr>                <lgl>        <dbl> <lgl> 
##  1    28 Stonewall Jackson    New Jersey           TRUE            NA NA    
##  2    22 Balzac Billy         Alberta              FALSE           NA NA    
##  3     9 Gertie the Groundhog Illinois             TRUE          2009 FALSE 
##  4     5 Concord Charlie      West Virginia        FALSE           NA NA    
##  5    38 Bowman Bill          Virginia             FALSE           NA NA    
##  6    16 Shubenacadie Sam     Nova Scotia          TRUE            NA NA    
##  7     4 Jimmy the Groundhog  Wisconsin            TRUE            NA NA    
##  8    14 Manitoba Merv        Manitoba             FALSE           NA NA    
##  9    56 Potomac Phil         District of Columbia FALSE           NA NA    
## 10    62 Gordy the Groundhog  Wisconsin            TRUE            NA NA

5. right_join

Describe the resulting data:

Columns: id, name, region, is_groundhog, year, shadow
Rows: 10

How is it different from the original two datasets? * Has all columns from both data sets present * All values in the name, region, and is_groundhog columns are NA except for id 9 because again it is the only common variable between the two datasets, also the shadow column for id 1 and id 10 are both NA, as that is how they’re in the predictions_small tbl.

groundhogs_small %>% right_join(predictions_small, by = c())

## Joining with `by = join_by(id)`

## # A tibble: 10 × 6
##       id name                 region   is_groundhog  year shadow
##    <dbl> <chr>                <chr>    <lgl>        <dbl> <lgl> 
##  1     9 Gertie the Groundhog Illinois TRUE          2009 FALSE 
##  2     1 <NA>                 <NA>     NA            1889 NA    
##  3    43 <NA>                 <NA>     NA            2015 TRUE  
##  4    23 <NA>                 <NA>     NA            2017 FALSE 
##  5    41 <NA>                 <NA>     NA            2020 FALSE 
##  6    26 <NA>                 <NA>     NA            2010 TRUE  
##  7     7 <NA>                 <NA>     NA            2011 FALSE 
##  8    11 <NA>                 <NA>     NA            2022 TRUE  
##  9    21 <NA>                 <NA>     NA            2006 TRUE  
## 10    10 <NA>                 <NA>     NA            1994 NA

6. full_join

Describe the resulting data:

Columns: id, name, region, is_grounding, year, shadow
Rows: 19

How is it different from the original two datasets? * Has all columns from both data sets present * 19 rows instead of 10, this is because now all the values from each data set are present and brought into 1 data set. And there is 19 rows instead of 20 because 1 of the variables between the two data sets (id 9) share the same id so it accounts for 1 row instead of having 2 seperate. Also because the second dataset doesn’t have the information for the name, region, and is_groundhog, that data is auto generated to NA since there are no values applicable. And same thing is true for the year and shadow columns from the second data set, they are filled out as NA next to the data from the first data set, except for the id 9 row, as it can automatically infer that the data is as follows in the row because the two data sets share that same value.

groundhogs_small %>% full_join(predictions_small, by = c())

## Joining with `by = join_by(id)`

## # A tibble: 19 × 6
##       id name                 region               is_groundhog  year shadow
##    <dbl> <chr>                <chr>                <lgl>        <dbl> <lgl> 
##  1    28 Stonewall Jackson    New Jersey           TRUE            NA NA    
##  2    22 Balzac Billy         Alberta              FALSE           NA NA    
##  3     9 Gertie the Groundhog Illinois             TRUE          2009 FALSE 
##  4     5 Concord Charlie      West Virginia        FALSE           NA NA    
##  5    38 Bowman Bill          Virginia             FALSE           NA NA    
##  6    16 Shubenacadie Sam     Nova Scotia          TRUE            NA NA    
##  7     4 Jimmy the Groundhog  Wisconsin            TRUE            NA NA    
##  8    14 Manitoba Merv        Manitoba             FALSE           NA NA    
##  9    56 Potomac Phil         District of Columbia FALSE           NA NA    
## 10    62 Gordy the Groundhog  Wisconsin            TRUE            NA NA    
## 11     1 <NA>                 <NA>                 NA            1889 NA    
## 12    43 <NA>                 <NA>                 NA            2015 TRUE  
## 13    23 <NA>                 <NA>                 NA            2017 FALSE 
## 14    41 <NA>                 <NA>                 NA            2020 FALSE 
## 15    26 <NA>                 <NA>                 NA            2010 TRUE  
## 16     7 <NA>                 <NA>                 NA            2011 FALSE 
## 17    11 <NA>                 <NA>                 NA            2022 TRUE  
## 18    21 <NA>                 <NA>                 NA            2006 TRUE  
## 19    10 <NA>                 <NA>                 NA            1994 NA

7. semi_join

Describe the resulting data:

Columns: id, name, region, is_groundhog
Rows: 1

How is it different from the original two datasets? * Has only the columns from the groundhogs_small dataset * The only row is the row with id 9 and which is similar to the inner_join function, however the difference is that in the inner_join data they include the year, and shadow where as semi_join only includes the data that is present in the first data set, with the matching variable from the second dataset.

groundhogs_small %>% semi_join(predictions_small, by = c())

## Joining with `by = join_by(id)`

## # A tibble: 1 × 4
##      id name                 region   is_groundhog
##   <dbl> <chr>                <chr>    <lgl>       
## 1     9 Gertie the Groundhog Illinois TRUE

8. anti_join

Describe the resulting data:

Columns: id, name, region, is_groundhog
Rows: 9

How is it different from the original two datasets? * Has only the columns from the groundhogs_small dataset * Is the same exact data set as groundhogs_small dataset, the only difference is that the anti_join filter, filters out the data that matches in both data sets which again is the id 9 data so it isn’t included wiht the anti_join function as it is a function to see the data that isnt correlating between the two data sets.

groundhogs_small %>% anti_join(predictions_small, by = c())

## Joining with `by = join_by(id)`

## # A tibble: 9 × 4
##      id name                region               is_groundhog
##   <dbl> <chr>               <chr>                <lgl>       
## 1    28 Stonewall Jackson   New Jersey           TRUE        
## 2    22 Balzac Billy        Alberta              FALSE       
## 3     5 Concord Charlie     West Virginia        FALSE       
## 4    38 Bowman Bill         Virginia             FALSE       
## 5    16 Shubenacadie Sam    Nova Scotia          TRUE        
## 6     4 Jimmy the Groundhog Wisconsin            TRUE        
## 7    14 Manitoba Merv       Manitoba             FALSE       
## 8    56 Potomac Phil        District of Columbia FALSE       
## 9    62 Gordy the Groundhog Wisconsin            TRUE

Week 9: Apply it to your data 8

Alden Dimick

2024-10-24

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join