Import two related datasets from TidyTuesday Project.
groundhogs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-01-30/groundhogs.csv')
## Rows: 75 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): slug, shortname, name, city, region, country, source, current_pred...
## dbl (4): id, latitude, longitude, predictions_count
## lgl (2): is_groundhog, active
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
predictions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-01-30/predictions.csv')
## Rows: 1462 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): details
## dbl (2): id, year
## lgl (1): shadow
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Describe the two datasets:
Data1: groundhogs
Data 2: predictions
set.seed(1234)
groundhogs_small <- groundhogs %>% select(id, name, region, is_groundhog) %>% sample_n(10)
predictions_small <- predictions %>% select(id, year, shadow) %>% sample_n(10)
groundhogs_small
## # A tibble: 10 × 4
## id name region is_groundhog
## <dbl> <chr> <chr> <lgl>
## 1 28 Stonewall Jackson New Jersey TRUE
## 2 22 Balzac Billy Alberta FALSE
## 3 9 Gertie the Groundhog Illinois TRUE
## 4 5 Concord Charlie West Virginia FALSE
## 5 38 Bowman Bill Virginia FALSE
## 6 16 Shubenacadie Sam Nova Scotia TRUE
## 7 4 Jimmy the Groundhog Wisconsin TRUE
## 8 14 Manitoba Merv Manitoba FALSE
## 9 56 Potomac Phil District of Columbia FALSE
## 10 62 Gordy the Groundhog Wisconsin TRUE
predictions_small
## # A tibble: 10 × 3
## id year shadow
## <dbl> <dbl> <lgl>
## 1 1 1889 NA
## 2 9 2009 FALSE
## 3 43 2015 TRUE
## 4 23 2017 FALSE
## 5 41 2020 FALSE
## 6 26 2010 TRUE
## 7 7 2011 FALSE
## 8 11 2022 TRUE
## 9 21 2006 TRUE
## 10 10 1994 NA
Describe the resulting data:
How is it different from the original two datasets? * 1 row compared to 10 in the origninal dataset * All the columns from both data sets are present
groundhogs_small %>% inner_join(predictions_small, by = c())
## Joining with `by = join_by(id)`
## # A tibble: 1 × 6
## id name region is_groundhog year shadow
## <dbl> <chr> <chr> <lgl> <dbl> <lgl>
## 1 9 Gertie the Groundhog Illinois TRUE 2009 FALSE
Describe the resulting data:
How is it different from the original two datasets? * Has all columns from both data sets present * Blocks out all data in the year, and shadow column except for the data that matches with both sets so id = 9 row is the only one that shows the values for year and shadow because id 9 shows up in both data sets
groundhogs_small %>% left_join(predictions_small, by = c())
## Joining with `by = join_by(id)`
## # A tibble: 10 × 6
## id name region is_groundhog year shadow
## <dbl> <chr> <chr> <lgl> <dbl> <lgl>
## 1 28 Stonewall Jackson New Jersey TRUE NA NA
## 2 22 Balzac Billy Alberta FALSE NA NA
## 3 9 Gertie the Groundhog Illinois TRUE 2009 FALSE
## 4 5 Concord Charlie West Virginia FALSE NA NA
## 5 38 Bowman Bill Virginia FALSE NA NA
## 6 16 Shubenacadie Sam Nova Scotia TRUE NA NA
## 7 4 Jimmy the Groundhog Wisconsin TRUE NA NA
## 8 14 Manitoba Merv Manitoba FALSE NA NA
## 9 56 Potomac Phil District of Columbia FALSE NA NA
## 10 62 Gordy the Groundhog Wisconsin TRUE NA NA
Describe the resulting data:
How is it different from the original two datasets? * Has all columns from both data sets present * All values in the name, region, and is_groundhog columns are NA except for id 9 because again it is the only common variable between the two datasets, also the shadow column for id 1 and id 10 are both NA, as that is how they’re in the predictions_small tbl.
groundhogs_small %>% right_join(predictions_small, by = c())
## Joining with `by = join_by(id)`
## # A tibble: 10 × 6
## id name region is_groundhog year shadow
## <dbl> <chr> <chr> <lgl> <dbl> <lgl>
## 1 9 Gertie the Groundhog Illinois TRUE 2009 FALSE
## 2 1 <NA> <NA> NA 1889 NA
## 3 43 <NA> <NA> NA 2015 TRUE
## 4 23 <NA> <NA> NA 2017 FALSE
## 5 41 <NA> <NA> NA 2020 FALSE
## 6 26 <NA> <NA> NA 2010 TRUE
## 7 7 <NA> <NA> NA 2011 FALSE
## 8 11 <NA> <NA> NA 2022 TRUE
## 9 21 <NA> <NA> NA 2006 TRUE
## 10 10 <NA> <NA> NA 1994 NA
Describe the resulting data:
How is it different from the original two datasets? * Has all columns from both data sets present * 19 rows instead of 10, this is because now all the values from each data set are present and brought into 1 data set. And there is 19 rows instead of 20 because 1 of the variables between the two data sets (id 9) share the same id so it accounts for 1 row instead of having 2 seperate. Also because the second dataset doesn’t have the information for the name, region, and is_groundhog, that data is auto generated to NA since there are no values applicable. And same thing is true for the year and shadow columns from the second data set, they are filled out as NA next to the data from the first data set, except for the id 9 row, as it can automatically infer that the data is as follows in the row because the two data sets share that same value.
groundhogs_small %>% full_join(predictions_small, by = c())
## Joining with `by = join_by(id)`
## # A tibble: 19 × 6
## id name region is_groundhog year shadow
## <dbl> <chr> <chr> <lgl> <dbl> <lgl>
## 1 28 Stonewall Jackson New Jersey TRUE NA NA
## 2 22 Balzac Billy Alberta FALSE NA NA
## 3 9 Gertie the Groundhog Illinois TRUE 2009 FALSE
## 4 5 Concord Charlie West Virginia FALSE NA NA
## 5 38 Bowman Bill Virginia FALSE NA NA
## 6 16 Shubenacadie Sam Nova Scotia TRUE NA NA
## 7 4 Jimmy the Groundhog Wisconsin TRUE NA NA
## 8 14 Manitoba Merv Manitoba FALSE NA NA
## 9 56 Potomac Phil District of Columbia FALSE NA NA
## 10 62 Gordy the Groundhog Wisconsin TRUE NA NA
## 11 1 <NA> <NA> NA 1889 NA
## 12 43 <NA> <NA> NA 2015 TRUE
## 13 23 <NA> <NA> NA 2017 FALSE
## 14 41 <NA> <NA> NA 2020 FALSE
## 15 26 <NA> <NA> NA 2010 TRUE
## 16 7 <NA> <NA> NA 2011 FALSE
## 17 11 <NA> <NA> NA 2022 TRUE
## 18 21 <NA> <NA> NA 2006 TRUE
## 19 10 <NA> <NA> NA 1994 NA
Describe the resulting data:
How is it different from the original two datasets? * Has only the columns from the groundhogs_small dataset * The only row is the row with id 9 and which is similar to the inner_join function, however the difference is that in the inner_join data they include the year, and shadow where as semi_join only includes the data that is present in the first data set, with the matching variable from the second dataset.
groundhogs_small %>% semi_join(predictions_small, by = c())
## Joining with `by = join_by(id)`
## # A tibble: 1 × 4
## id name region is_groundhog
## <dbl> <chr> <chr> <lgl>
## 1 9 Gertie the Groundhog Illinois TRUE
Describe the resulting data:
How is it different from the original two datasets? * Has only the columns from the groundhogs_small dataset * Is the same exact data set as groundhogs_small dataset, the only difference is that the anti_join filter, filters out the data that matches in both data sets which again is the id 9 data so it isn’t included wiht the anti_join function as it is a function to see the data that isnt correlating between the two data sets.
groundhogs_small %>% anti_join(predictions_small, by = c())
## Joining with `by = join_by(id)`
## # A tibble: 9 × 4
## id name region is_groundhog
## <dbl> <chr> <chr> <lgl>
## 1 28 Stonewall Jackson New Jersey TRUE
## 2 22 Balzac Billy Alberta FALSE
## 3 5 Concord Charlie West Virginia FALSE
## 4 38 Bowman Bill Virginia FALSE
## 5 16 Shubenacadie Sam Nova Scotia TRUE
## 6 4 Jimmy the Groundhog Wisconsin TRUE
## 7 14 Manitoba Merv Manitoba FALSE
## 8 56 Potomac Phil District of Columbia FALSE
## 9 62 Gordy the Groundhog Wisconsin TRUE