Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

winners <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-04-25/winners.csv')

## Rows: 163 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): Category, Athlete, Nationality
## dbl  (1): Year
## time (1): Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

london_marathon <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-04-25/london_marathon.csv')

## Rows: 42 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Official charity
## dbl  (6): Year, Applicants, Accepted, Starters, Finishers, Raised
## date (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1: winners

Columns: Year, Time, Nationality
Rows: 10 rows

Data 2: london_marathon_sm

Columns: Year, Raised, Finishers
Rows: 10 rows

set.seed(1234)
winners_sm  <- winners %>% select(Year, Time, Nationality) %>% sample_n(10)
london_marathon_sm <- london_marathon %>% select(Year, Raised, Finishers) %>% sample_n(10)
winners_sm

## # A tibble: 10 × 3
##     Year Time     Nationality   
##    <dbl> <time>   <chr>         
##  1  2007 02:07:41 Kenya         
##  2  2017 02:17:01 Kenya         
##  3  2008 01:48:04 Switzerland   
##  4  1998 01:35:18 Switzerland   
##  5  2008 01:33:56 United Kingdom
##  6  1995 02:17:02 United Kingdom
##  7  1991 02:04:40 Denmark       
##  8  2002 02:22:51 United Kingdom
##  9  1990 02:10:25 Denmark       
## 10  1995 01:39:14 Switzerland

london_marathon_sm

## # A tibble: 10 × 3
##     Year Raised Finishers
##    <dbl>  <dbl>     <dbl>
##  1  2019   66.4     42549
##  2  2002   NA       32950
##  3  2006   NA       33250
##  4  1986   NA       18067
##  5  1995   NA       25377
##  6  1994   NA       25242
##  7  2017   61.5     39487
##  8  1984   NA       15675
##  9  2015   54.1     37793
## 10  2001   NA       30318

3. inner_join

Describe the resulting data:

Columns: Year, Time, Nationality, Raised, Finishers
Rows: 4 rows

How is it different from the original two datasets? This dataset only has 4 rows compared to the 10 of both of the original datasets. This dataset has the year column that they both share as well as the rest of columns of both the data sets have.

winners_sm %>% inner_join(london_marathon_sm)

## Joining with `by = join_by(Year)`

## # A tibble: 4 × 5
##    Year Time     Nationality    Raised Finishers
##   <dbl> <time>   <chr>           <dbl>     <dbl>
## 1  2017 02:17:01 Kenya            61.5     39487
## 2  1995 02:17:02 United Kingdom   NA       25377
## 3  2002 02:22:51 United Kingdom   NA       32950
## 4  1995 01:39:14 Switzerland      NA       25377

4. left_join

Describe the resulting data:

Columns: Year, Time, Nationality, Raised, Finishers
Rows: 10 rows

How is it different from the original two datasets? This dataset has 10 rows just like both of the original datasets. This dataset has the year, time, and nationality columns with all of the data from winners_sm as well as the raised and finishers columns with data where applicable.

left_join(winners_sm,london_marathon_sm)

## Joining with `by = join_by(Year)`

## # A tibble: 10 × 5
##     Year Time     Nationality    Raised Finishers
##    <dbl> <time>   <chr>           <dbl>     <dbl>
##  1  2007 02:07:41 Kenya            NA          NA
##  2  2017 02:17:01 Kenya            61.5     39487
##  3  2008 01:48:04 Switzerland      NA          NA
##  4  1998 01:35:18 Switzerland      NA          NA
##  5  2008 01:33:56 United Kingdom   NA          NA
##  6  1995 02:17:02 United Kingdom   NA       25377
##  7  1991 02:04:40 Denmark          NA          NA
##  8  2002 02:22:51 United Kingdom   NA       32950
##  9  1990 02:10:25 Denmark          NA          NA
## 10  1995 01:39:14 Switzerland      NA       25377

5. right_join

Describe the resulting data:

Columns: Year, Time, Nationality, Raised, Finishers
Rows: 11 rows

How is it different from the original two datasets? This dataset has 11 rows one more than both of the original datasets. This dataset has the year and finishers columns with all of the data from london_marathon_sm as well as the raised, nationality, and time columns with data where applicable.

right_join(winners_sm,london_marathon_sm)

## Joining with `by = join_by(Year)`

## # A tibble: 11 × 5
##     Year Time     Nationality    Raised Finishers
##    <dbl> <time>   <chr>           <dbl>     <dbl>
##  1  2017 02:17:01 Kenya            61.5     39487
##  2  1995 02:17:02 United Kingdom   NA       25377
##  3  2002 02:22:51 United Kingdom   NA       32950
##  4  1995 01:39:14 Switzerland      NA       25377
##  5  2019       NA <NA>             66.4     42549
##  6  2006       NA <NA>             NA       33250
##  7  1986       NA <NA>             NA       18067
##  8  1994       NA <NA>             NA       25242
##  9  1984       NA <NA>             NA       15675
## 10  2015       NA <NA>             54.1     37793
## 11  2001       NA <NA>             NA       30318

6. full_join

Describe the resulting data:

Columns: Year, Time, Nationality, Raised, Finishers
Rows: 17 rows

How is it different from the original two datasets? This dataset has 17 rows seven more than both of the original datasets. This dataset has all of the columns and data where available from both data sets combined without overlap.

full_join(winners_sm,london_marathon_sm)

## Joining with `by = join_by(Year)`

## # A tibble: 17 × 5
##     Year Time     Nationality    Raised Finishers
##    <dbl> <time>   <chr>           <dbl>     <dbl>
##  1  2007 02:07:41 Kenya            NA          NA
##  2  2017 02:17:01 Kenya            61.5     39487
##  3  2008 01:48:04 Switzerland      NA          NA
##  4  1998 01:35:18 Switzerland      NA          NA
##  5  2008 01:33:56 United Kingdom   NA          NA
##  6  1995 02:17:02 United Kingdom   NA       25377
##  7  1991 02:04:40 Denmark          NA          NA
##  8  2002 02:22:51 United Kingdom   NA       32950
##  9  1990 02:10:25 Denmark          NA          NA
## 10  1995 01:39:14 Switzerland      NA       25377
## 11  2019       NA <NA>             66.4     42549
## 12  2006       NA <NA>             NA       33250
## 13  1986       NA <NA>             NA       18067
## 14  1994       NA <NA>             NA       25242
## 15  1984       NA <NA>             NA       15675
## 16  2015       NA <NA>             54.1     37793
## 17  2001       NA <NA>             NA       30318

7. semi_join

Describe the resulting data:

Columns: Year, Time, Nationality
Rows: 4 rows

How is it different from the original two datasets? This dataset only has 4 rows compared to the 10 of both of the original datasets. This dataset has kept all observations in winners_sm that have a match in london_marathon_sm. *It only has three columns which are all the original columns of winners_sm.

semi_join(winners_sm,london_marathon_sm)

## Joining with `by = join_by(Year)`

## # A tibble: 4 × 3
##    Year Time     Nationality   
##   <dbl> <time>   <chr>         
## 1  2017 02:17:01 Kenya         
## 2  1995 02:17:02 United Kingdom
## 3  2002 02:22:51 United Kingdom
## 4  1995 01:39:14 Switzerland

8. anti_join

Describe the resulting data:

Columns: Year, Time, Nationality
Rows: 6 rows

How is it different from the original two datasets? This dataset only has 6 rows compared to the 10 of both of the original datasets. This dataset has dropped all observations in winners_sm that have a match in london_marathon_sm. *It only has three columns which are all the original columns of winners_sm.

anti_join(winners_sm,london_marathon_sm)

## Joining with `by = join_by(Year)`

## # A tibble: 6 × 3
##    Year Time     Nationality   
##   <dbl> <time>   <chr>         
## 1  2007 02:07:41 Kenya         
## 2  2008 01:48:04 Switzerland   
## 3  1998 01:35:18 Switzerland   
## 4  2008 01:33:56 United Kingdom
## 5  1991 02:04:40 Denmark       
## 6  1990 02:10:25 Denmark

Week 9: Apply it to your data 8

Brady Martin

2024-06-13

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join