Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

World_cup_matches <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-11-29/wcmatches.csv')

## Rows: 900 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (11): country, city, stage, home_team, away_team, outcome, win_conditio...
## dbl   (3): year, home_score, away_score
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

World_Cups <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-11-29/worldcups.csv')

## Rows: 21 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): host, winner, second, third, fourth
## dbl (5): year, goals_scored, teams, games, attendance
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data 1: World cup matches

Columns: year, winning team, home score, away score
Rows: 15

Data 2: Different World cups

Columns: year, winning team, runner-up
Rows: 15

set.seed(1234)
World_cup_matches_small <- World_cup_matches %>% select(year, winning_team, home_score, away_score) %>% sample_n(15)
World_Cups_small <- World_Cups %>% select (year, winner, second) %>% sample_n(15)

World_cup_matches_small

## # A tibble: 15 × 4
##     year winning_team home_score away_score
##    <dbl> <chr>             <dbl>      <dbl>
##  1  1978 <NA>                  0          0
##  2  2018 Sweden                1          0
##  3  1954 West Germany          3          2
##  4  2002 <NA>                  1          1
##  5  2006 Germany               4          2
##  6  1986 Brazil                4          0
##  7  1954 West Germany          1          6
##  8  1958 Brazil                0          3
##  9  2010 Argentina             4          1
## 10  2002 Spain                 3          1
## 11  1982 Soviet Union          3          0
## 12  1954 Yugoslavia            0          1
## 13  2018 Belgium               0          1
## 14  1974 West Germany          2          1
## 15  1986 Denmark               6          1

World_Cups_small

## # A tibble: 15 × 3
##     year winner       second        
##    <dbl> <chr>        <chr>         
##  1  1950 Uruguay      Brazil        
##  2  2018 France       Croatia       
##  3  1966 England      West Germany  
##  4  1938 Italy        Hungary       
##  5  2014 Germany      Argentina     
##  6  1994 Brazil       Italy         
##  7  1998 France       Brazil        
##  8  1986 Argentina    West Germany  
##  9  1974 West Germany Netherlands   
## 10  1954 West Germany Hungary       
## 11  1934 Italy        Czechoslovakia
## 12  2010 Spain        Netherlands   
## 13  2002 Brazil       Germany       
## 14  1990 West Germany Argentina     
## 15  1970 Brazil       Italy

3. inner_join

Describe the resulting data: Combine data points from points where they overlap with each other.

Columns: 11
Rows: 6

How is it different from the original two datasets? This one only shows the data from the points where the two overlap

World_cup_matches_small %>% inner_join(World_Cups_small)

## Joining with `by = join_by(year)`

## # A tibble: 11 × 6
##     year winning_team home_score away_score winner       second      
##    <dbl> <chr>             <dbl>      <dbl> <chr>        <chr>       
##  1  2018 Sweden                1          0 France       Croatia     
##  2  1954 West Germany          3          2 West Germany Hungary     
##  3  2002 <NA>                  1          1 Brazil       Germany     
##  4  1986 Brazil                4          0 Argentina    West Germany
##  5  1954 West Germany          1          6 West Germany Hungary     
##  6  2010 Argentina             4          1 Spain        Netherlands 
##  7  2002 Spain                 3          1 Brazil       Germany     
##  8  1954 Yugoslavia            0          1 West Germany Hungary     
##  9  2018 Belgium               0          1 France       Croatia     
## 10  1974 West Germany          2          1 West Germany Netherlands 
## 11  1986 Denmark               6          1 Argentina    West Germany

4. left_join

Describe the resulting data:

Columns: 15
Rows: 6

How is it different from the original two datasets? This one shows a bit more data than the small datasets, but only shows the chosen data from the original datasets.

World_cup_matches_small %>% left_join(World_Cups_small)

## Joining with `by = join_by(year)`

## # A tibble: 15 × 6
##     year winning_team home_score away_score winner       second      
##    <dbl> <chr>             <dbl>      <dbl> <chr>        <chr>       
##  1  1978 <NA>                  0          0 <NA>         <NA>        
##  2  2018 Sweden                1          0 France       Croatia     
##  3  1954 West Germany          3          2 West Germany Hungary     
##  4  2002 <NA>                  1          1 Brazil       Germany     
##  5  2006 Germany               4          2 <NA>         <NA>        
##  6  1986 Brazil                4          0 Argentina    West Germany
##  7  1954 West Germany          1          6 West Germany Hungary     
##  8  1958 Brazil                0          3 <NA>         <NA>        
##  9  2010 Argentina             4          1 Spain        Netherlands 
## 10  2002 Spain                 3          1 Brazil       Germany     
## 11  1982 Soviet Union          3          0 <NA>         <NA>        
## 12  1954 Yugoslavia            0          1 West Germany Hungary     
## 13  2018 Belgium               0          1 France       Croatia     
## 14  1974 West Germany          2          1 West Germany Netherlands 
## 15  1986 Denmark               6          1 Argentina    West Germany

5. right_join

Describe the resulting data:

Columns: 20
Rows: 6

How is it different from the original two datasets? Same as the left_join

World_cup_matches_small %>% right_join(World_Cups_small)

## Joining with `by = join_by(year)`

## # A tibble: 20 × 6
##     year winning_team home_score away_score winner       second        
##    <dbl> <chr>             <dbl>      <dbl> <chr>        <chr>         
##  1  2018 Sweden                1          0 France       Croatia       
##  2  1954 West Germany          3          2 West Germany Hungary       
##  3  2002 <NA>                  1          1 Brazil       Germany       
##  4  1986 Brazil                4          0 Argentina    West Germany  
##  5  1954 West Germany          1          6 West Germany Hungary       
##  6  2010 Argentina             4          1 Spain        Netherlands   
##  7  2002 Spain                 3          1 Brazil       Germany       
##  8  1954 Yugoslavia            0          1 West Germany Hungary       
##  9  2018 Belgium               0          1 France       Croatia       
## 10  1974 West Germany          2          1 West Germany Netherlands   
## 11  1986 Denmark               6          1 Argentina    West Germany  
## 12  1950 <NA>                 NA         NA Uruguay      Brazil        
## 13  1966 <NA>                 NA         NA England      West Germany  
## 14  1938 <NA>                 NA         NA Italy        Hungary       
## 15  2014 <NA>                 NA         NA Germany      Argentina     
## 16  1994 <NA>                 NA         NA Brazil       Italy         
## 17  1998 <NA>                 NA         NA France       Brazil        
## 18  1934 <NA>                 NA         NA Italy        Czechoslovakia
## 19  1990 <NA>                 NA         NA West Germany Argentina     
## 20  1970 <NA>                 NA         NA Brazil       Italy

6. full_join

Describe the resulting data:

Columns: 24
Rows: 6

How is it different from the original two datasets? The full join shows year, winning team, home score, away score, winner and runner up, which is pretty much all of the data from the original datasets.

World_cup_matches_small %>% full_join(World_Cups_small)

## Joining with `by = join_by(year)`

## # A tibble: 24 × 6
##     year winning_team home_score away_score winner       second      
##    <dbl> <chr>             <dbl>      <dbl> <chr>        <chr>       
##  1  1978 <NA>                  0          0 <NA>         <NA>        
##  2  2018 Sweden                1          0 France       Croatia     
##  3  1954 West Germany          3          2 West Germany Hungary     
##  4  2002 <NA>                  1          1 Brazil       Germany     
##  5  2006 Germany               4          2 <NA>         <NA>        
##  6  1986 Brazil                4          0 Argentina    West Germany
##  7  1954 West Germany          1          6 West Germany Hungary     
##  8  1958 Brazil                0          3 <NA>         <NA>        
##  9  2010 Argentina             4          1 Spain        Netherlands 
## 10  2002 Spain                 3          1 Brazil       Germany     
## # ℹ 14 more rows

7. semi_join

Describe the resulting data:

Columns: 11
Rows: 4

How is it different from the original two datasets? Only shows 11 columns of the overlapping datasets, coming from year, winning team, home score and away score.

World_cup_matches_small %>% semi_join(World_Cups_small)

## Joining with `by = join_by(year)`

## # A tibble: 11 × 4
##     year winning_team home_score away_score
##    <dbl> <chr>             <dbl>      <dbl>
##  1  2018 Sweden                1          0
##  2  1954 West Germany          3          2
##  3  2002 <NA>                  1          1
##  4  1986 Brazil                4          0
##  5  1954 West Germany          1          6
##  6  2010 Argentina             4          1
##  7  2002 Spain                 3          1
##  8  1954 Yugoslavia            0          1
##  9  2018 Belgium               0          1
## 10  1974 West Germany          2          1
## 11  1986 Denmark               6          1

8. anti_join

Describe the resulting data:

Columns: 4
Rows: 4

How is it different from the original two datasets? Only shows year, winning team, home score and away score

World_cup_matches_small %>% anti_join(World_Cups_small)

## Joining with `by = join_by(year)`

## # A tibble: 4 × 4
##    year winning_team home_score away_score
##   <dbl> <chr>             <dbl>      <dbl>
## 1  1978 <NA>                  0          0
## 2  2006 Germany               4          2
## 3  1958 Brazil                0          3
## 4  1982 Soviet Union          3          0

Week 9: Apply it to your data 8

Sondre Asheim

2024-03-28

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join