Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

expeditions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-09-22/expeditions.csv')

## Rows: 10364 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): expedition_id, peak_id, peak_name, season, termination_reason, tre...
## dbl  (6): year, highpoint_metres, members, member_deaths, hired_staff, hired...
## lgl  (1): oxygen_used
## date (3): basecamp_date, highpoint_date, termination_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

members <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-09-22/members.csv')

## Rows: 76519 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): expedition_id, member_id, peak_id, peak_name, season, sex, citizen...
## dbl  (5): year, age, highpoint_metres, death_height_metres, injury_height_me...
## lgl  (6): hired, success, solo, oxygen_used, died, injured
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1

Columns:year, season, highpoint_metres, peak_name
Rows:20

Data 2

Columns:year, season, citizenship
Rows:20

set.seed(1234)
expeditions_small <- expeditions %>% select(year, season, highpoint_metres, peak_name) %>% sample_n(20)
members_small <- members %>% select(year, season, citizenship) %>% sample_n(20)

expeditions_small

## # A tibble: 20 × 4
##     year season highpoint_metres peak_name      
##    <dbl> <chr>             <dbl> <chr>          
##  1  2011 Autumn             8188 Cho Oyu        
##  2  2013 Spring             8850 Everest        
##  3  2011 Spring             6600 Putha Hiunchuli
##  4  2013 Spring             8188 Cho Oyu        
##  5  2010 Spring             8850 Everest        
##  6  2016 Spring             8850 Everest        
##  7  1991 Autumn             8200 Yalung Kang    
##  8  1986 Autumn             7250 Manaslu        
##  9  1909 Autumn             6965 Langpo         
## 10  1978 Spring             5000 Tukuche        
## 11  1962 Spring             8720 Everest        
## 12  2007 Spring             8850 Everest        
## 13  1972 Autumn             7268 Dhaulagiri VI  
## 14  1992 Spring             8200 Everest        
## 15  2003 Spring             6500 Langtang Ri    
## 16  1999 Autumn             8188 Cho Oyu        
## 17  1998 Spring             8850 Everest        
## 18  2001 Autumn             6966 Dorje Lhakpa   
## 19  1982 Spring             8400 Everest        
## 20  2006 Spring             8850 Everest

members_small

## # A tibble: 20 × 3
##     year season citizenship
##    <dbl> <chr>  <chr>      
##  1  1994 Spring Nepal      
##  2  2004 Autumn Nepal      
##  3  1991 Autumn Venezuela  
##  4  2007 Autumn France     
##  5  1980 Spring UK         
##  6  2017 Autumn Poland     
##  7  2019 Spring China      
##  8  1992 Autumn France     
##  9  2012 Autumn USA        
## 10  1988 Autumn Japan      
## 11  1996 Autumn Slovenia   
## 12  2018 Autumn Japan      
## 13  1995 Autumn Nepal      
## 14  1997 Spring Spain      
## 15  1993 Spring India      
## 16  2013 Autumn France     
## 17  2019 Spring Russia     
## 18  2019 Spring Nepal      
## 19  1991 Autumn India      
## 20  1979 Winter Poland

3. inner_join

Describe the resulting data:

Columns:year season citizenship highpoint_metres peak_name
Rows:2

How is it different from the original two datasets? 2 rows compared to 20 All columns present

members_small %>% inner_join(expeditions_small, by = c("year", "season"))

## # A tibble: 2 × 5
##    year season citizenship highpoint_metres peak_name  
##   <dbl> <chr>  <chr>                  <dbl> <chr>      
## 1  1991 Autumn Venezuela               8200 Yalung Kang
## 2  1991 Autumn India                   8200 Yalung Kang

4. left_join

Describe the resulting data:

Columns:year season citizenship highpoint_metres peak_name
Rows: 20

How is it different from the original two datasets? Same number of columns and rows, but NA values are introduced

members_small %>% left_join(expeditions_small, by = c("year", "season"))

## # A tibble: 20 × 5
##     year season citizenship highpoint_metres peak_name  
##    <dbl> <chr>  <chr>                  <dbl> <chr>      
##  1  1994 Spring Nepal                     NA <NA>       
##  2  2004 Autumn Nepal                     NA <NA>       
##  3  1991 Autumn Venezuela               8200 Yalung Kang
##  4  2007 Autumn France                    NA <NA>       
##  5  1980 Spring UK                        NA <NA>       
##  6  2017 Autumn Poland                    NA <NA>       
##  7  2019 Spring China                     NA <NA>       
##  8  1992 Autumn France                    NA <NA>       
##  9  2012 Autumn USA                       NA <NA>       
## 10  1988 Autumn Japan                     NA <NA>       
## 11  1996 Autumn Slovenia                  NA <NA>       
## 12  2018 Autumn Japan                     NA <NA>       
## 13  1995 Autumn Nepal                     NA <NA>       
## 14  1997 Spring Spain                     NA <NA>       
## 15  1993 Spring India                     NA <NA>       
## 16  2013 Autumn France                    NA <NA>       
## 17  2019 Spring Russia                    NA <NA>       
## 18  2019 Spring Nepal                     NA <NA>       
## 19  1991 Autumn India                   8200 Yalung Kang
## 20  1979 Winter Poland                    NA <NA>

5. right_join

Describe the resulting data:

Columns:year season citizenship highpoint_metres peak_name
Rows: 10

How is it different from the original two datasets? 21 rows present this time with na values in the citzenship column

members_small %>% right_join(expeditions_small, by = c("year", "season"))

## # A tibble: 21 × 5
##     year season citizenship highpoint_metres peak_name      
##    <dbl> <chr>  <chr>                  <dbl> <chr>          
##  1  1991 Autumn Venezuela               8200 Yalung Kang    
##  2  1991 Autumn India                   8200 Yalung Kang    
##  3  2011 Autumn <NA>                    8188 Cho Oyu        
##  4  2013 Spring <NA>                    8850 Everest        
##  5  2011 Spring <NA>                    6600 Putha Hiunchuli
##  6  2013 Spring <NA>                    8188 Cho Oyu        
##  7  2010 Spring <NA>                    8850 Everest        
##  8  2016 Spring <NA>                    8850 Everest        
##  9  1986 Autumn <NA>                    7250 Manaslu        
## 10  1909 Autumn <NA>                    6965 Langpo         
## # ℹ 11 more rows

6. full_join

Describe the resulting data:

Columns:year season citizenship highpoint_metres peak_name
Rows:39

How is it different from the original two datasets? Full join is a combination of left and right join

members_small %>% full_join(expeditions_small, by = c("year", "season"))

## # A tibble: 39 × 5
##     year season citizenship highpoint_metres peak_name  
##    <dbl> <chr>  <chr>                  <dbl> <chr>      
##  1  1994 Spring Nepal                     NA <NA>       
##  2  2004 Autumn Nepal                     NA <NA>       
##  3  1991 Autumn Venezuela               8200 Yalung Kang
##  4  2007 Autumn France                    NA <NA>       
##  5  1980 Spring UK                        NA <NA>       
##  6  2017 Autumn Poland                    NA <NA>       
##  7  2019 Spring China                     NA <NA>       
##  8  1992 Autumn France                    NA <NA>       
##  9  2012 Autumn USA                       NA <NA>       
## 10  1988 Autumn Japan                     NA <NA>       
## # ℹ 29 more rows

7. semi_join

Describe the resulting data:

Columns:year season citizenship
Rows:2

How is it different from the original two datasets? Only two rows. The same rows as inner join but without columns from expedition dataset

members_small %>% semi_join(expeditions_small, by = c("year", "season"))

## # A tibble: 2 × 3
##    year season citizenship
##   <dbl> <chr>  <chr>      
## 1  1991 Autumn Venezuela  
## 2  1991 Autumn India

8. anti_join

Describe the resulting data:

Columns: year season citizenship
Rows:18

How is it different from the original two datasets? Shows rows that are not related to expeditions_small

members_small %>% anti_join(expeditions_small, by = c("year", "season"))

## # A tibble: 18 × 3
##     year season citizenship
##    <dbl> <chr>  <chr>      
##  1  1994 Spring Nepal      
##  2  2004 Autumn Nepal      
##  3  2007 Autumn France     
##  4  1980 Spring UK         
##  5  2017 Autumn Poland     
##  6  2019 Spring China      
##  7  1992 Autumn France     
##  8  2012 Autumn USA        
##  9  1988 Autumn Japan      
## 10  1996 Autumn Slovenia   
## 11  2018 Autumn Japan      
## 12  1995 Autumn Nepal      
## 13  1997 Spring Spain      
## 14  1993 Spring India      
## 15  2013 Autumn France     
## 16  2019 Spring Russia     
## 17  2019 Spring Nepal      
## 18  1979 Winter Poland

Week 9: Apply it to your data 8

Angus Somers

2042-04-19

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join