Import your data

bakers <- read_excel("../01_module4/data/myData.xlsx")
challenges <- read_csv("../00_data/challenges.csv")
## Rows: 1136 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): baker, result, signature, showstopper
## dbl (3): series, episode, technical
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Make Data Small

set.seed(2002)
bakers_small <- bakers %>% select(series, baker, series_winner) %>% sample_n(10)
challenges_small <- challenges %>% select(series, baker, result) %>% sample_n(10)

bakers_small
## # A tibble: 10 × 3
##    series baker  series_winner
##     <dbl> <chr>          <dbl>
##  1      4 Ali                0
##  2      8 Liam               0
##  3      9 Briony             0
##  4      9 Jon                0
##  5      9 Antony             0
##  6      5 Chetna             0
##  7      7 Andrew             0
##  8      8 James              0
##  9     10 David              1
## 10      7 Selasi             0
challenges_small
## # A tibble: 10 × 3
##    series baker  result    
##     <dbl> <chr>  <chr>     
##  1     10 Rosie  IN        
##  2      6 Marie  STAR BAKER
##  3      5 Jordan <NA>      
##  4      6 Mat    OUT       
##  5      8 Julia  <NA>      
##  6      5 Chetna IN        
##  7      5 Jordan <NA>      
##  8      6 Alvin  <NA>      
##  9      4 Toby   <NA>      
## 10      3 John   IN

Chapter 13

Inner_join

bakers_small %>%
    inner_join(challenges_small, by = c("baker", "series"))
## # A tibble: 1 × 4
##   series baker  series_winner result
##    <dbl> <chr>          <dbl> <chr> 
## 1      5 Chetna             0 IN

Column: season, name, series winner, result Rows: 1

How is it different from the original data set? there is one row instead of 10 rows, and all columns are still filled.

Left_join

bakers_small %>%
    left_join(challenges_small, by = c("baker", "series"))
## # A tibble: 10 × 4
##    series baker  series_winner result
##     <dbl> <chr>          <dbl> <chr> 
##  1      4 Ali                0 <NA>  
##  2      8 Liam               0 <NA>  
##  3      9 Briony             0 <NA>  
##  4      9 Jon                0 <NA>  
##  5      9 Antony             0 <NA>  
##  6      5 Chetna             0 IN    
##  7      7 Andrew             0 <NA>  
##  8      8 James              0 <NA>  
##  9     10 David              1 <NA>  
## 10      7 Selasi             0 <NA>

Column: season, name, series winner, result Row: 10 rows How is it different from the original data? There are 10 rows displaying from the original data sets. There is a only one contestant that was included in both data sets which shows more information on their row than the other contestants.

Right_join

challenges_small %>%
    right_join(bakers_small, by = c("baker", "series"))
## # A tibble: 10 × 4
##    series baker  result series_winner
##     <dbl> <chr>  <chr>          <dbl>
##  1      5 Chetna IN                 0
##  2      4 Ali    <NA>               0
##  3      8 Liam   <NA>               0
##  4      9 Briony <NA>               0
##  5      9 Jon    <NA>               0
##  6      9 Antony <NA>               0
##  7      7 Andrew <NA>               0
##  8      8 James  <NA>               0
##  9     10 David  <NA>               1
## 10      7 Selasi <NA>               0

Column: season, name, series winner, result Rows: 10

How is it different from the original data? there were then rows showing in the data there is missing information from the series and bakers not matching between the data sets. The first row shows the information clear across the data set, and shows that only one person was in both series.

Semi_join

challenges_small %>%
    semi_join(bakers_small, by = c("baker", "series"))
## # A tibble: 1 × 3
##   series baker  result
##    <dbl> <chr>  <chr> 
## 1      5 Chetna IN

Columns: series, baker, result Rows: 1

How is it different from the original data? There is only one row showing out of the tens rows of data. There is no column for series winner. This shows only partial information and not all of the information.

Anti_join

challenges_small %>%
    anti_join(bakers_small, by = c("baker", "series"))
## # A tibble: 9 × 3
##   series baker  result    
##    <dbl> <chr>  <chr>     
## 1     10 Rosie  IN        
## 2      6 Marie  STAR BAKER
## 3      5 Jordan <NA>      
## 4      6 Mat    OUT       
## 5      8 Julia  <NA>      
## 6      5 Jordan <NA>      
## 7      6 Alvin  <NA>      
## 8      4 Toby   <NA>      
## 9      3 John   IN

Columns: series, baker, result rows: 9

How is this different from the original data sets? there are nine rows out of ten which shows that the row that had matching data was not included with anti join. there is not another column for series which also shows that the columns do not join with anti join as well.