Import two related datasets from TidyTuesday Project.
expeditions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-09-22/expeditions.csv')
## Rows: 10364 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): expedition_id, peak_id, peak_name, season, termination_reason, tre...
## dbl (6): year, highpoint_metres, members, member_deaths, hired_staff, hired...
## lgl (1): oxygen_used
## date (3): basecamp_date, highpoint_date, termination_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
members <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-09-22/members.csv')
## Rows: 76519 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): expedition_id, member_id, peak_id, peak_name, season, sex, citizen...
## dbl (5): year, age, highpoint_metres, death_height_metres, injury_height_me...
## lgl (6): hired, success, solo, oxygen_used, died, injured
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Describe the two datasets:
Data1
Data 2
set.seed(1234)
expeditions_small <- expeditions %>% select(year, season, highpoint_metres, peak_name) %>% sample_n(20)
members_small <- members %>% select(year, season, citizenship) %>% sample_n(20)
expeditions_small
## # A tibble: 20 × 4
## year season highpoint_metres peak_name
## <dbl> <chr> <dbl> <chr>
## 1 2011 Autumn 8188 Cho Oyu
## 2 2013 Spring 8850 Everest
## 3 2011 Spring 6600 Putha Hiunchuli
## 4 2013 Spring 8188 Cho Oyu
## 5 2010 Spring 8850 Everest
## 6 2016 Spring 8850 Everest
## 7 1991 Autumn 8200 Yalung Kang
## 8 1986 Autumn 7250 Manaslu
## 9 1909 Autumn 6965 Langpo
## 10 1978 Spring 5000 Tukuche
## 11 1962 Spring 8720 Everest
## 12 2007 Spring 8850 Everest
## 13 1972 Autumn 7268 Dhaulagiri VI
## 14 1992 Spring 8200 Everest
## 15 2003 Spring 6500 Langtang Ri
## 16 1999 Autumn 8188 Cho Oyu
## 17 1998 Spring 8850 Everest
## 18 2001 Autumn 6966 Dorje Lhakpa
## 19 1982 Spring 8400 Everest
## 20 2006 Spring 8850 Everest
members_small
## # A tibble: 20 × 3
## year season citizenship
## <dbl> <chr> <chr>
## 1 1994 Spring Nepal
## 2 2004 Autumn Nepal
## 3 1991 Autumn Venezuela
## 4 2007 Autumn France
## 5 1980 Spring UK
## 6 2017 Autumn Poland
## 7 2019 Spring China
## 8 1992 Autumn France
## 9 2012 Autumn USA
## 10 1988 Autumn Japan
## 11 1996 Autumn Slovenia
## 12 2018 Autumn Japan
## 13 1995 Autumn Nepal
## 14 1997 Spring Spain
## 15 1993 Spring India
## 16 2013 Autumn France
## 17 2019 Spring Russia
## 18 2019 Spring Nepal
## 19 1991 Autumn India
## 20 1979 Winter Poland
Describe the resulting data:
How is it different from the original two datasets? 2 rows compared to 20 All columns present
members_small %>% inner_join(expeditions_small, by = c("year", "season"))
## # A tibble: 2 × 5
## year season citizenship highpoint_metres peak_name
## <dbl> <chr> <chr> <dbl> <chr>
## 1 1991 Autumn Venezuela 8200 Yalung Kang
## 2 1991 Autumn India 8200 Yalung Kang
Describe the resulting data:
How is it different from the original two datasets? Same number of columns and rows, but NA values are introduced
members_small %>% left_join(expeditions_small, by = c("year", "season"))
## # A tibble: 20 × 5
## year season citizenship highpoint_metres peak_name
## <dbl> <chr> <chr> <dbl> <chr>
## 1 1994 Spring Nepal NA <NA>
## 2 2004 Autumn Nepal NA <NA>
## 3 1991 Autumn Venezuela 8200 Yalung Kang
## 4 2007 Autumn France NA <NA>
## 5 1980 Spring UK NA <NA>
## 6 2017 Autumn Poland NA <NA>
## 7 2019 Spring China NA <NA>
## 8 1992 Autumn France NA <NA>
## 9 2012 Autumn USA NA <NA>
## 10 1988 Autumn Japan NA <NA>
## 11 1996 Autumn Slovenia NA <NA>
## 12 2018 Autumn Japan NA <NA>
## 13 1995 Autumn Nepal NA <NA>
## 14 1997 Spring Spain NA <NA>
## 15 1993 Spring India NA <NA>
## 16 2013 Autumn France NA <NA>
## 17 2019 Spring Russia NA <NA>
## 18 2019 Spring Nepal NA <NA>
## 19 1991 Autumn India 8200 Yalung Kang
## 20 1979 Winter Poland NA <NA>
Describe the resulting data:
How is it different from the original two datasets? 21 rows present this time with na values in the citzenship column
members_small %>% right_join(expeditions_small, by = c("year", "season"))
## # A tibble: 21 × 5
## year season citizenship highpoint_metres peak_name
## <dbl> <chr> <chr> <dbl> <chr>
## 1 1991 Autumn Venezuela 8200 Yalung Kang
## 2 1991 Autumn India 8200 Yalung Kang
## 3 2011 Autumn <NA> 8188 Cho Oyu
## 4 2013 Spring <NA> 8850 Everest
## 5 2011 Spring <NA> 6600 Putha Hiunchuli
## 6 2013 Spring <NA> 8188 Cho Oyu
## 7 2010 Spring <NA> 8850 Everest
## 8 2016 Spring <NA> 8850 Everest
## 9 1986 Autumn <NA> 7250 Manaslu
## 10 1909 Autumn <NA> 6965 Langpo
## # ℹ 11 more rows
Describe the resulting data:
How is it different from the original two datasets? Full join is a combination of left and right join
members_small %>% full_join(expeditions_small, by = c("year", "season"))
## # A tibble: 39 × 5
## year season citizenship highpoint_metres peak_name
## <dbl> <chr> <chr> <dbl> <chr>
## 1 1994 Spring Nepal NA <NA>
## 2 2004 Autumn Nepal NA <NA>
## 3 1991 Autumn Venezuela 8200 Yalung Kang
## 4 2007 Autumn France NA <NA>
## 5 1980 Spring UK NA <NA>
## 6 2017 Autumn Poland NA <NA>
## 7 2019 Spring China NA <NA>
## 8 1992 Autumn France NA <NA>
## 9 2012 Autumn USA NA <NA>
## 10 1988 Autumn Japan NA <NA>
## # ℹ 29 more rows
Describe the resulting data:
How is it different from the original two datasets? Only two rows. The same rows as inner join but without columns from expedition dataset
members_small %>% semi_join(expeditions_small, by = c("year", "season"))
## # A tibble: 2 × 3
## year season citizenship
## <dbl> <chr> <chr>
## 1 1991 Autumn Venezuela
## 2 1991 Autumn India
Describe the resulting data:
How is it different from the original two datasets? Shows rows that are not related to expeditions_small
members_small %>% anti_join(expeditions_small, by = c("year", "season"))
## # A tibble: 18 × 3
## year season citizenship
## <dbl> <chr> <chr>
## 1 1994 Spring Nepal
## 2 2004 Autumn Nepal
## 3 2007 Autumn France
## 4 1980 Spring UK
## 5 2017 Autumn Poland
## 6 2019 Spring China
## 7 1992 Autumn France
## 8 2012 Autumn USA
## 9 1988 Autumn Japan
## 10 1996 Autumn Slovenia
## 11 2018 Autumn Japan
## 12 1995 Autumn Nepal
## 13 1997 Spring Spain
## 14 1993 Spring India
## 15 2013 Autumn France
## 16 2019 Spring Russia
## 17 2019 Spring Nepal
## 18 1979 Winter Poland