bakers <- read_excel("../01_module4/data/myData.xlsx")
challenges <- read_csv("../00_data/challenges.csv")
## Rows: 1136 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): baker, result, signature, showstopper
## dbl (3): series, episode, technical
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(2002)
bakers_small <- bakers %>% select(series, baker, series_winner) %>% sample_n(10)
challenges_small <- challenges %>% select(series, baker, result) %>% sample_n(10)
bakers_small
## # A tibble: 10 × 3
## series baker series_winner
## <dbl> <chr> <dbl>
## 1 4 Ali 0
## 2 8 Liam 0
## 3 9 Briony 0
## 4 9 Jon 0
## 5 9 Antony 0
## 6 5 Chetna 0
## 7 7 Andrew 0
## 8 8 James 0
## 9 10 David 1
## 10 7 Selasi 0
challenges_small
## # A tibble: 10 × 3
## series baker result
## <dbl> <chr> <chr>
## 1 10 Rosie IN
## 2 6 Marie STAR BAKER
## 3 5 Jordan <NA>
## 4 6 Mat OUT
## 5 8 Julia <NA>
## 6 5 Chetna IN
## 7 5 Jordan <NA>
## 8 6 Alvin <NA>
## 9 4 Toby <NA>
## 10 3 John IN
bakers_small %>%
inner_join(challenges_small, by = c("baker", "series"))
## # A tibble: 1 × 4
## series baker series_winner result
## <dbl> <chr> <dbl> <chr>
## 1 5 Chetna 0 IN
Column: season, name, series winner, result Rows: 1
How is it different from the original data set? there is one row instead of 10 rows, and all columns are still filled.
bakers_small %>%
left_join(challenges_small, by = c("baker", "series"))
## # A tibble: 10 × 4
## series baker series_winner result
## <dbl> <chr> <dbl> <chr>
## 1 4 Ali 0 <NA>
## 2 8 Liam 0 <NA>
## 3 9 Briony 0 <NA>
## 4 9 Jon 0 <NA>
## 5 9 Antony 0 <NA>
## 6 5 Chetna 0 IN
## 7 7 Andrew 0 <NA>
## 8 8 James 0 <NA>
## 9 10 David 1 <NA>
## 10 7 Selasi 0 <NA>
Column: season, name, series winner, result Row: 10 rows How is it different from the original data? There are 10 rows displaying from the original data sets. There is a only one contestant that was included in both data sets which shows more information on their row than the other contestants.
challenges_small %>%
right_join(bakers_small, by = c("baker", "series"))
## # A tibble: 10 × 4
## series baker result series_winner
## <dbl> <chr> <chr> <dbl>
## 1 5 Chetna IN 0
## 2 4 Ali <NA> 0
## 3 8 Liam <NA> 0
## 4 9 Briony <NA> 0
## 5 9 Jon <NA> 0
## 6 9 Antony <NA> 0
## 7 7 Andrew <NA> 0
## 8 8 James <NA> 0
## 9 10 David <NA> 1
## 10 7 Selasi <NA> 0
Column: season, name, series winner, result Rows: 10
How is it different from the original data? there were then rows showing in the data there is missing information from the series and bakers not matching between the data sets. The first row shows the information clear across the data set, and shows that only one person was in both series.
challenges_small %>%
semi_join(bakers_small, by = c("baker", "series"))
## # A tibble: 1 × 3
## series baker result
## <dbl> <chr> <chr>
## 1 5 Chetna IN
Columns: series, baker, result Rows: 1
How is it different from the original data? There is only one row showing out of the tens rows of data. There is no column for series winner. This shows only partial information and not all of the information.
challenges_small %>%
anti_join(bakers_small, by = c("baker", "series"))
## # A tibble: 9 × 3
## series baker result
## <dbl> <chr> <chr>
## 1 10 Rosie IN
## 2 6 Marie STAR BAKER
## 3 5 Jordan <NA>
## 4 6 Mat OUT
## 5 8 Julia <NA>
## 6 5 Jordan <NA>
## 7 6 Alvin <NA>
## 8 4 Toby <NA>
## 9 3 John IN
Columns: series, baker, result rows: 9
How is this different from the original data sets? there are nine rows out of ten which shows that the row that had matching data was not included with anti join. there is not another column for series which also shows that the columns do not join with anti join as well.