1. Import your data

Import two related datasets from TidyTuesday Project.

# Dataset 1: Olympics athlete events
olympics <- read_excel("~/Desktop/PSU_DAT3000_IntroToDA/06_module9/myData.xlsx")

# Dataset 2: NOC country codes and region names
noc_regions <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv") %>%
  select(noc, team) %>%
  distinct() %>%
  rename(region = team)
## Rows: 271116 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, sex, team, noc, games, season, city, sport, event, medal
## dbl  (5): id, age, height, weight, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

olympics_small <- olympics %>%
  filter(season == "Summer", year >= 2000) %>%
  select(id, name, sex, noc, sport, medal) %>%
  distinct()


noc_small <- noc_regions %>%
  slice_head(n = 50)  

Describe the two datasets:

DData 1 — olympics_small

Data 2 — noc_small

3. inner_join

inner_joined <- inner_join(olympics_small, noc_small, by = "noc")
## Warning in inner_join(olympics_small, noc_small, by = "noc"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 5 of `x` matches multiple rows in `y`.
## ℹ Row 6 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
glimpse(inner_joined)
## Rows: 26,084
## Columns: 7
## $ id     <dbl> 2, 12, 13, 18, 21, 21, 22, 23, 23, 30, 34, 36, 48, 49, 51, 53, …
## $ name   <chr> "A Lamusi", "Jyri Tapani Aalto", "Minna Maarit Aalto", "Timo An…
## $ sex    <chr> "M", "M", "F", "M", "F", "F", "F", "M", "M", "M", "M", "M", "M"…
## $ noc    <chr> "CHN", "FIN", "FIN", "FIN", "NOR", "NOR", "ROU", "NOR", "NOR", …
## $ sport  <chr> "Judo", "Badminton", "Sailing", "Athletics", "Handball", "Handb…
## $ medal  <chr> "NA", "NA", "NA", "NA", "Gold", "Gold", "NA", "NA", "NA", "NA",…
## $ region <chr> "China", "Finland", "Finland", "Finland", "Norway", "Taifun", "…

Describe the resulting data:

Columns: 7 (6 from olympics_small + region from noc_small) Rows: fewer than olympics_small — only athletes whose noc matches one of the 50 codes in noc_small

How is it different from the original two datasets?

Athletes not in noc_small are dropped. Fewer rows than olympics_small, but adds a region column.

4. left_join

left_joined <- left_join(olympics_small, noc_small, by = "noc")
## Warning in left_join(olympics_small, noc_small, by = "noc"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 5 of `x` matches multiple rows in `y`.
## ℹ Row 6 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
glimpse(left_joined)
## Rows: 45,014
## Columns: 7
## $ id     <dbl> 2, 12, 13, 18, 21, 21, 22, 23, 23, 30, 34, 36, 48, 49, 51, 53, …
## $ name   <chr> "A Lamusi", "Jyri Tapani Aalto", "Minna Maarit Aalto", "Timo An…
## $ sex    <chr> "M", "M", "F", "M", "F", "F", "F", "M", "M", "M", "M", "M", "M"…
## $ noc    <chr> "CHN", "FIN", "FIN", "FIN", "NOR", "NOR", "ROU", "NOR", "NOR", …
## $ sport  <chr> "Judo", "Badminton", "Sailing", "Athletics", "Handball", "Handb…
## $ medal  <chr> "NA", "NA", "NA", "NA", "Gold", "Gold", "NA", "NA", "NA", "NA",…
## $ region <chr> "China", "Finland", "Finland", "Finland", "Norway", "Taifun", "…

Describe the resulting data:

Columns: 7 (same as inner join) * Rows: same as olympics_small — no athletes are lost

How is it different from the original two datasets?

All Olympic athletes stay in the data. If an athlete’s country code wasn’t in noc_small, the region part will just say NA. This type of join adds extra info without removing any original rows.

5. right_join

right_joined <- right_join(olympics_small, noc_small, by = "noc")
## Warning in right_join(olympics_small, noc_small, by = "noc"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 5 of `x` matches multiple rows in `y`.
## ℹ Row 6 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
glimpse(right_joined)
## Rows: 26,086
## Columns: 7
## $ id     <dbl> 2, 12, 13, 18, 21, 21, 22, 23, 23, 30, 34, 36, 48, 49, 51, 53, …
## $ name   <chr> "A Lamusi", "Jyri Tapani Aalto", "Minna Maarit Aalto", "Timo An…
## $ sex    <chr> "M", "M", "F", "M", "F", "F", "F", "M", "M", "M", "M", "M", "M"…
## $ noc    <chr> "CHN", "FIN", "FIN", "FIN", "NOR", "NOR", "ROU", "NOR", "NOR", …
## $ sport  <chr> "Judo", "Badminton", "Sailing", "Athletics", "Handball", "Handb…
## $ medal  <chr> "NA", "NA", "NA", "NA", "Gold", "Gold", "NA", "NA", "NA", "NA",…
## $ region <chr> "China", "Finland", "Finland", "Finland", "Norway", "Taifun", "…

Describe the resulting data:

How is it different from the original two datasets?

All 50 NOC codes will stay in the data. If a country in noc_small had no athletes in olympics_small, the athlete info will show as NA. This is the opposite of a left_join because it keeps the NOC list as the priority.

6. full_join

full_joined <- full_join(olympics_small, noc_small, by = "noc")
## Warning in full_join(olympics_small, noc_small, by = "noc"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 5 of `x` matches multiple rows in `y`.
## ℹ Row 6 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
glimpse(full_joined)
## Rows: 45,016
## Columns: 7
## $ id     <dbl> 2, 12, 13, 18, 21, 21, 22, 23, 23, 30, 34, 36, 48, 49, 51, 53, …
## $ name   <chr> "A Lamusi", "Jyri Tapani Aalto", "Minna Maarit Aalto", "Timo An…
## $ sex    <chr> "M", "M", "F", "M", "F", "F", "F", "M", "M", "M", "M", "M", "M"…
## $ noc    <chr> "CHN", "FIN", "FIN", "FIN", "NOR", "NOR", "ROU", "NOR", "NOR", …
## $ sport  <chr> "Judo", "Badminton", "Sailing", "Athletics", "Handball", "Handb…
## $ medal  <chr> "NA", "NA", "NA", "NA", "Gold", "Gold", "NA", "NA", "NA", "NA",…
## $ region <chr> "China", "Finland", "Finland", "Finland", "Norway", "Taifun", "…

Describe the resulting data:

How is it different from the original two datasets?

Nothing gets removed. Athletes with unmatched NOC codes will have NA for region, and NOC codes with no athletes will have NA for the athlete info. This is the most inclusive join because it keeps everything, even mismatches.

7. semi_join

semi_joined <- semi_join(olympics_small, noc_small, by = "noc")
glimpse(semi_joined)
## Rows: 24,688
## Columns: 6
## $ id    <dbl> 2, 12, 13, 18, 21, 22, 23, 30, 34, 36, 48, 49, 51, 53, 55, 62, 6…
## $ name  <chr> "A Lamusi", "Jyri Tapani Aalto", "Minna Maarit Aalto", "Timo Ant…
## $ sex   <chr> "M", "M", "F", "M", "F", "F", "M", "M", "M", "M", "M", "F", "M",…
## $ noc   <chr> "CHN", "FIN", "FIN", "FIN", "NOR", "ROU", "NOR", "NED", "FRA", "…
## $ sport <chr> "Judo", "Badminton", "Sailing", "Athletics", "Handball", "Weight…
## $ medal <chr> "NA", "NA", "NA", "NA", "Gold", "NA", "NA", "NA", "NA", "NA", "N…

Describe the resulting data:

Columns: 6 Rows: fewer than olympics_small

How is it different from the original two datasets?

No new columns are added. It just filters olympics_small to only athletes whose country is in noc_small. Useful for narrowing down the data without combining tables.

8. anti_join

anti_joined <- anti_join(olympics_small, noc_small, by = "noc")
glimpse(anti_joined)
## Rows: 18,930
## Columns: 6
## $ id    <dbl> 251, 258, 272, 273, 279, 310, 337, 338, 344, 345, 352, 353, 354,…
## $ name  <chr> "Bashir Abdi", "Hamza Abdo", "Berik Abdrakhmanov", "Maizurah Abd…
## $ sex   <chr> "M", "M", "M", "F", "M", "F", "M", "M", "M", "M", "M", "M", "M",…
## $ noc   <chr> "BEL", "PLE", "KAZ", "BRU", "BRU", "INA", "UZB", "UZB", "UAE", "…
## $ sport <chr> "Athletics", "Swimming", "Boxing", "Athletics", "Shooting", "Wei…
## $ medal <chr> "NA", "NA", "NA", "NA", "NA", "NA", "Gold", "NA", "NA", "NA", "N…

Describe the resulting data:

Columns: 6 Rows: athletes whose NOC code does not appear in noc_small

How is it different from the original two datasets?

This shows what’s missing from the right table. It’s useful for finding join problems, like athletes that would be removed in an inner_join because their country code isn’t in the lookup table.