Import two related datasets from TidyTuesday Project.
winners <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-04-25/winners.csv')
## Rows: 163 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Category, Athlete, Nationality
## dbl (1): Year
## time (1): Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
london_marathon <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-04-25/london_marathon.csv')
## Rows: 42 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Official charity
## dbl (6): Year, Applicants, Accepted, Starters, Finishers, Raised
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Describe the two datasets:
Data1: winners
Data 2: london_marathon_sm
set.seed(1234)
winners_sm <- winners %>% select(Year, Time, Nationality) %>% sample_n(10)
london_marathon_sm <- london_marathon %>% select(Year, Raised, Finishers) %>% sample_n(10)
winners_sm
## # A tibble: 10 × 3
## Year Time Nationality
## <dbl> <time> <chr>
## 1 2007 02:07:41 Kenya
## 2 2017 02:17:01 Kenya
## 3 2008 01:48:04 Switzerland
## 4 1998 01:35:18 Switzerland
## 5 2008 01:33:56 United Kingdom
## 6 1995 02:17:02 United Kingdom
## 7 1991 02:04:40 Denmark
## 8 2002 02:22:51 United Kingdom
## 9 1990 02:10:25 Denmark
## 10 1995 01:39:14 Switzerland
london_marathon_sm
## # A tibble: 10 × 3
## Year Raised Finishers
## <dbl> <dbl> <dbl>
## 1 2019 66.4 42549
## 2 2002 NA 32950
## 3 2006 NA 33250
## 4 1986 NA 18067
## 5 1995 NA 25377
## 6 1994 NA 25242
## 7 2017 61.5 39487
## 8 1984 NA 15675
## 9 2015 54.1 37793
## 10 2001 NA 30318
Describe the resulting data:
How is it different from the original two datasets? This dataset only has 4 rows compared to the 10 of both of the original datasets. This dataset has the year column that they both share as well as the rest of columns of both the data sets have.
winners_sm %>% inner_join(london_marathon_sm)
## Joining with `by = join_by(Year)`
## # A tibble: 4 × 5
## Year Time Nationality Raised Finishers
## <dbl> <time> <chr> <dbl> <dbl>
## 1 2017 02:17:01 Kenya 61.5 39487
## 2 1995 02:17:02 United Kingdom NA 25377
## 3 2002 02:22:51 United Kingdom NA 32950
## 4 1995 01:39:14 Switzerland NA 25377
Describe the resulting data:
How is it different from the original two datasets? This dataset has 10 rows just like both of the original datasets. This dataset has the year, time, and nationality columns with all of the data from winners_sm as well as the raised and finishers columns with data where applicable.
left_join(winners_sm,london_marathon_sm)
## Joining with `by = join_by(Year)`
## # A tibble: 10 × 5
## Year Time Nationality Raised Finishers
## <dbl> <time> <chr> <dbl> <dbl>
## 1 2007 02:07:41 Kenya NA NA
## 2 2017 02:17:01 Kenya 61.5 39487
## 3 2008 01:48:04 Switzerland NA NA
## 4 1998 01:35:18 Switzerland NA NA
## 5 2008 01:33:56 United Kingdom NA NA
## 6 1995 02:17:02 United Kingdom NA 25377
## 7 1991 02:04:40 Denmark NA NA
## 8 2002 02:22:51 United Kingdom NA 32950
## 9 1990 02:10:25 Denmark NA NA
## 10 1995 01:39:14 Switzerland NA 25377
Describe the resulting data:
How is it different from the original two datasets? This dataset has 11 rows one more than both of the original datasets. This dataset has the year and finishers columns with all of the data from london_marathon_sm as well as the raised, nationality, and time columns with data where applicable.
right_join(winners_sm,london_marathon_sm)
## Joining with `by = join_by(Year)`
## # A tibble: 11 × 5
## Year Time Nationality Raised Finishers
## <dbl> <time> <chr> <dbl> <dbl>
## 1 2017 02:17:01 Kenya 61.5 39487
## 2 1995 02:17:02 United Kingdom NA 25377
## 3 2002 02:22:51 United Kingdom NA 32950
## 4 1995 01:39:14 Switzerland NA 25377
## 5 2019 NA <NA> 66.4 42549
## 6 2006 NA <NA> NA 33250
## 7 1986 NA <NA> NA 18067
## 8 1994 NA <NA> NA 25242
## 9 1984 NA <NA> NA 15675
## 10 2015 NA <NA> 54.1 37793
## 11 2001 NA <NA> NA 30318
Describe the resulting data:
How is it different from the original two datasets? This dataset has 17 rows seven more than both of the original datasets. This dataset has all of the columns and data where available from both data sets combined without overlap.
full_join(winners_sm,london_marathon_sm)
## Joining with `by = join_by(Year)`
## # A tibble: 17 × 5
## Year Time Nationality Raised Finishers
## <dbl> <time> <chr> <dbl> <dbl>
## 1 2007 02:07:41 Kenya NA NA
## 2 2017 02:17:01 Kenya 61.5 39487
## 3 2008 01:48:04 Switzerland NA NA
## 4 1998 01:35:18 Switzerland NA NA
## 5 2008 01:33:56 United Kingdom NA NA
## 6 1995 02:17:02 United Kingdom NA 25377
## 7 1991 02:04:40 Denmark NA NA
## 8 2002 02:22:51 United Kingdom NA 32950
## 9 1990 02:10:25 Denmark NA NA
## 10 1995 01:39:14 Switzerland NA 25377
## 11 2019 NA <NA> 66.4 42549
## 12 2006 NA <NA> NA 33250
## 13 1986 NA <NA> NA 18067
## 14 1994 NA <NA> NA 25242
## 15 1984 NA <NA> NA 15675
## 16 2015 NA <NA> 54.1 37793
## 17 2001 NA <NA> NA 30318
Describe the resulting data:
How is it different from the original two datasets? This dataset only has 4 rows compared to the 10 of both of the original datasets. This dataset has kept all observations in winners_sm that have a match in london_marathon_sm. *It only has three columns which are all the original columns of winners_sm.
semi_join(winners_sm,london_marathon_sm)
## Joining with `by = join_by(Year)`
## # A tibble: 4 × 3
## Year Time Nationality
## <dbl> <time> <chr>
## 1 2017 02:17:01 Kenya
## 2 1995 02:17:02 United Kingdom
## 3 2002 02:22:51 United Kingdom
## 4 1995 01:39:14 Switzerland
Describe the resulting data:
How is it different from the original two datasets? This dataset only has 6 rows compared to the 10 of both of the original datasets. This dataset has dropped all observations in winners_sm that have a match in london_marathon_sm. *It only has three columns which are all the original columns of winners_sm.
anti_join(winners_sm,london_marathon_sm)
## Joining with `by = join_by(Year)`
## # A tibble: 6 × 3
## Year Time Nationality
## <dbl> <time> <chr>
## 1 2007 02:07:41 Kenya
## 2 2008 01:48:04 Switzerland
## 3 1998 01:35:18 Switzerland
## 4 2008 01:33:56 United Kingdom
## 5 1991 02:04:40 Denmark
## 6 1990 02:10:25 Denmark