Import two related datasets from TidyTuesday Project.
# Dataset 1: Olympics athlete events
olympics <- read_excel("~/Desktop/PSU_DAT3000_IntroToDA/06_module9/myData.xlsx")
# Dataset 2: NOC country codes and region names
noc_regions <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv") %>%
select(noc, team) %>%
distinct() %>%
rename(region = team)
## Rows: 271116 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, sex, team, noc, games, season, city, sport, event, medal
## dbl (5): id, age, height, weight, year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
olympics_small <- olympics %>%
filter(season == "Summer", year >= 2000) %>%
select(id, name, sex, noc, sport, medal) %>%
distinct()
noc_small <- noc_regions %>%
slice_head(n = 50)
Describe the two datasets:
DData 1 — olympics_small
Data 2 — noc_small
inner_joined <- inner_join(olympics_small, noc_small, by = "noc")
## Warning in inner_join(olympics_small, noc_small, by = "noc"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 5 of `x` matches multiple rows in `y`.
## ℹ Row 6 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
glimpse(inner_joined)
## Rows: 26,084
## Columns: 7
## $ id <dbl> 2, 12, 13, 18, 21, 21, 22, 23, 23, 30, 34, 36, 48, 49, 51, 53, …
## $ name <chr> "A Lamusi", "Jyri Tapani Aalto", "Minna Maarit Aalto", "Timo An…
## $ sex <chr> "M", "M", "F", "M", "F", "F", "F", "M", "M", "M", "M", "M", "M"…
## $ noc <chr> "CHN", "FIN", "FIN", "FIN", "NOR", "NOR", "ROU", "NOR", "NOR", …
## $ sport <chr> "Judo", "Badminton", "Sailing", "Athletics", "Handball", "Handb…
## $ medal <chr> "NA", "NA", "NA", "NA", "Gold", "Gold", "NA", "NA", "NA", "NA",…
## $ region <chr> "China", "Finland", "Finland", "Finland", "Norway", "Taifun", "…
Describe the resulting data:
Columns: 7 (6 from olympics_small + region from noc_small) Rows: fewer than olympics_small — only athletes whose noc matches one of the 50 codes in noc_small
How is it different from the original two datasets?
Athletes not in noc_small are dropped. Fewer rows than olympics_small, but adds a region column.
left_joined <- left_join(olympics_small, noc_small, by = "noc")
## Warning in left_join(olympics_small, noc_small, by = "noc"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 5 of `x` matches multiple rows in `y`.
## ℹ Row 6 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
glimpse(left_joined)
## Rows: 45,014
## Columns: 7
## $ id <dbl> 2, 12, 13, 18, 21, 21, 22, 23, 23, 30, 34, 36, 48, 49, 51, 53, …
## $ name <chr> "A Lamusi", "Jyri Tapani Aalto", "Minna Maarit Aalto", "Timo An…
## $ sex <chr> "M", "M", "F", "M", "F", "F", "F", "M", "M", "M", "M", "M", "M"…
## $ noc <chr> "CHN", "FIN", "FIN", "FIN", "NOR", "NOR", "ROU", "NOR", "NOR", …
## $ sport <chr> "Judo", "Badminton", "Sailing", "Athletics", "Handball", "Handb…
## $ medal <chr> "NA", "NA", "NA", "NA", "Gold", "Gold", "NA", "NA", "NA", "NA",…
## $ region <chr> "China", "Finland", "Finland", "Finland", "Norway", "Taifun", "…
Describe the resulting data:
Columns: 7 (same as inner join) * Rows: same as
olympics_small — no athletes are lost
How is it different from the original two datasets?
All Olympic athletes stay in the data. If an athlete’s country code wasn’t in noc_small, the region part will just say NA. This type of join adds extra info without removing any original rows.
right_joined <- right_join(olympics_small, noc_small, by = "noc")
## Warning in right_join(olympics_small, noc_small, by = "noc"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 5 of `x` matches multiple rows in `y`.
## ℹ Row 6 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
glimpse(right_joined)
## Rows: 26,086
## Columns: 7
## $ id <dbl> 2, 12, 13, 18, 21, 21, 22, 23, 23, 30, 34, 36, 48, 49, 51, 53, …
## $ name <chr> "A Lamusi", "Jyri Tapani Aalto", "Minna Maarit Aalto", "Timo An…
## $ sex <chr> "M", "M", "F", "M", "F", "F", "F", "M", "M", "M", "M", "M", "M"…
## $ noc <chr> "CHN", "FIN", "FIN", "FIN", "NOR", "NOR", "ROU", "NOR", "NOR", …
## $ sport <chr> "Judo", "Badminton", "Sailing", "Athletics", "Handball", "Handb…
## $ medal <chr> "NA", "NA", "NA", "NA", "Gold", "Gold", "NA", "NA", "NA", "NA",…
## $ region <chr> "China", "Finland", "Finland", "Finland", "Norway", "Taifun", "…
Describe the resulting data:
How is it different from the original two datasets?
All 50 NOC codes will stay in the data. If a country in noc_small had no athletes in olympics_small, the athlete info will show as NA. This is the opposite of a left_join because it keeps the NOC list as the priority.
full_joined <- full_join(olympics_small, noc_small, by = "noc")
## Warning in full_join(olympics_small, noc_small, by = "noc"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 5 of `x` matches multiple rows in `y`.
## ℹ Row 6 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
glimpse(full_joined)
## Rows: 45,016
## Columns: 7
## $ id <dbl> 2, 12, 13, 18, 21, 21, 22, 23, 23, 30, 34, 36, 48, 49, 51, 53, …
## $ name <chr> "A Lamusi", "Jyri Tapani Aalto", "Minna Maarit Aalto", "Timo An…
## $ sex <chr> "M", "M", "F", "M", "F", "F", "F", "M", "M", "M", "M", "M", "M"…
## $ noc <chr> "CHN", "FIN", "FIN", "FIN", "NOR", "NOR", "ROU", "NOR", "NOR", …
## $ sport <chr> "Judo", "Badminton", "Sailing", "Athletics", "Handball", "Handb…
## $ medal <chr> "NA", "NA", "NA", "NA", "Gold", "Gold", "NA", "NA", "NA", "NA",…
## $ region <chr> "China", "Finland", "Finland", "Finland", "Norway", "Taifun", "…
Describe the resulting data:
How is it different from the original two datasets?
Nothing gets removed. Athletes with unmatched NOC codes will have NA for region, and NOC codes with no athletes will have NA for the athlete info. This is the most inclusive join because it keeps everything, even mismatches.
semi_joined <- semi_join(olympics_small, noc_small, by = "noc")
glimpse(semi_joined)
## Rows: 24,688
## Columns: 6
## $ id <dbl> 2, 12, 13, 18, 21, 22, 23, 30, 34, 36, 48, 49, 51, 53, 55, 62, 6…
## $ name <chr> "A Lamusi", "Jyri Tapani Aalto", "Minna Maarit Aalto", "Timo Ant…
## $ sex <chr> "M", "M", "F", "M", "F", "F", "M", "M", "M", "M", "M", "F", "M",…
## $ noc <chr> "CHN", "FIN", "FIN", "FIN", "NOR", "ROU", "NOR", "NED", "FRA", "…
## $ sport <chr> "Judo", "Badminton", "Sailing", "Athletics", "Handball", "Weight…
## $ medal <chr> "NA", "NA", "NA", "NA", "Gold", "NA", "NA", "NA", "NA", "NA", "N…
Describe the resulting data:
Columns: 6 Rows: fewer than olympics_small
How is it different from the original two datasets?
No new columns are added. It just filters olympics_small to only athletes whose country is in noc_small. Useful for narrowing down the data without combining tables.
anti_joined <- anti_join(olympics_small, noc_small, by = "noc")
glimpse(anti_joined)
## Rows: 18,930
## Columns: 6
## $ id <dbl> 251, 258, 272, 273, 279, 310, 337, 338, 344, 345, 352, 353, 354,…
## $ name <chr> "Bashir Abdi", "Hamza Abdo", "Berik Abdrakhmanov", "Maizurah Abd…
## $ sex <chr> "M", "M", "M", "F", "M", "F", "M", "M", "M", "M", "M", "M", "M",…
## $ noc <chr> "BEL", "PLE", "KAZ", "BRU", "BRU", "INA", "UZB", "UZB", "UAE", "…
## $ sport <chr> "Athletics", "Swimming", "Boxing", "Athletics", "Shooting", "Wei…
## $ medal <chr> "NA", "NA", "NA", "NA", "NA", "NA", "Gold", "NA", "NA", "NA", "N…
Describe the resulting data:
Columns: 6 Rows: athletes whose NOC code does not appear in
noc_small
How is it different from the original two datasets?
This shows what’s missing from the right table. It’s useful for finding join problems, like athletes that would be removed in an inner_join because their country code isn’t in the lookup table.