Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-02-04/attendance.csv')

## Rows: 10846 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): team, team_name
## dbl (6): year, total, home, away, week, weekly_attendance
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-02-04/standings.csv')

## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1: attendance_small

Columns: team_name, year, week
Rows: 10 rows

Data 2: standings_small

Columns: team, year, wins
Rows: 10 rows

set.seed(1234)
attendance_small <- attendance %>% select(team_name, year, week) %>% sample_n(10)
standings_small <- standings %>% select(team, year, wins) %>% sample_n(10)

attendance_small

## # A tibble: 10 × 3
##    team_name   year  week
##    <chr>      <dbl> <dbl>
##  1 Steelers    2013     6
##  2 Chargers    2014     9
##  3 Browns      2013     5
##  4 Buccaneers  2014    11
##  5 Colts       2013    10
##  6 Titans      2016    16
##  7 Bears       2001    11
##  8 Steelers    2001    16
##  9 Chiefs      2005     7
## 10 Cardinals   2004     4

standings_small

## # A tibble: 10 × 3
##    team           year  wins
##    <chr>         <dbl> <dbl>
##  1 Indianapolis   2003    12
##  2 Tampa Bay      2018     5
##  3 Cincinnati     2010     4
##  4 Philadelphia   2002    12
##  5 Kansas City    2008     2
##  6 St. Louis      2011     2
##  7 Carolina       2005    11
##  8 San Francisco  2017     6
##  9 Buffalo        2000     8
## 10 Tennessee      2017     9

3. inner_join

Describe the resulting data:

Columns: team_name, year, week, team, wins
Rows: 1

How is it different from the original two datasets?

Only 1 row compared to 10
all columns from the two original datasets are present in the inner joined dataset

attendance_small %>% inner_join(standings_small)

## Joining with `by = join_by(year)`

## # A tibble: 1 × 5
##   team_name  year  week team      wins
##   <chr>     <dbl> <dbl> <chr>    <dbl>
## 1 Chiefs     2005     7 Carolina    11

4. left_join

Describe the resulting data:

Columns: team_name, year, week, team, wins
Rows: 10

How is it different from the original two datasets?

It looks at all columns from the original small dataset
Includes data points that are not identical, allowing for moire variety in the dataset as a whole

left_join(attendance_small, standings_small)

## Joining with `by = join_by(year)`

## # A tibble: 10 × 5
##    team_name   year  week team      wins
##    <chr>      <dbl> <dbl> <chr>    <dbl>
##  1 Steelers    2013     6 <NA>        NA
##  2 Chargers    2014     9 <NA>        NA
##  3 Browns      2013     5 <NA>        NA
##  4 Buccaneers  2014    11 <NA>        NA
##  5 Colts       2013    10 <NA>        NA
##  6 Titans      2016    16 <NA>        NA
##  7 Bears       2001    11 <NA>        NA
##  8 Steelers    2001    16 <NA>        NA
##  9 Chiefs      2005     7 Carolina    11
## 10 Cardinals   2004     4 <NA>        NA

5. right_join

Describe the resulting data:

Columns: team_name, year, week, team, wins
Rows: 10

How is it different from the original two datasets?

It shows the same data, but instead of team being on the right, it is year
allows for you to look at the data differently due to the orientation of the rows, can allow for better analysis

right_join(attendance_small, standings_small)

## Joining with `by = join_by(year)`

## # A tibble: 10 × 5
##    team_name  year  week team           wins
##    <chr>     <dbl> <dbl> <chr>         <dbl>
##  1 Chiefs     2005     7 Carolina         11
##  2 <NA>       2003    NA Indianapolis     12
##  3 <NA>       2018    NA Tampa Bay         5
##  4 <NA>       2010    NA Cincinnati        4
##  5 <NA>       2002    NA Philadelphia     12
##  6 <NA>       2008    NA Kansas City       2
##  7 <NA>       2011    NA St. Louis         2
##  8 <NA>       2017    NA San Francisco     6
##  9 <NA>       2000    NA Buffalo           8
## 10 <NA>       2017    NA Tennessee         9

6. full_join

Describe the resulting data:

Columns: team_name, year, week, team, wins
Rows: 19

How is it different from the original two datasets?

This fully joined dataset allows you to see where there is any points of overlap where the data matches
In my example, there was only one where Carolina had the best record in the 2005 NFL season with 11 wins, however Kansas City had the best attendance in week 7, and not Carolina

full_join(standings_small, attendance_small, by = c("year"))

## # A tibble: 19 × 5
##    team           year  wins team_name   week
##    <chr>         <dbl> <dbl> <chr>      <dbl>
##  1 Indianapolis   2003    12 <NA>          NA
##  2 Tampa Bay      2018     5 <NA>          NA
##  3 Cincinnati     2010     4 <NA>          NA
##  4 Philadelphia   2002    12 <NA>          NA
##  5 Kansas City    2008     2 <NA>          NA
##  6 St. Louis      2011     2 <NA>          NA
##  7 Carolina       2005    11 Chiefs         7
##  8 San Francisco  2017     6 <NA>          NA
##  9 Buffalo        2000     8 <NA>          NA
## 10 Tennessee      2017     9 <NA>          NA
## 11 <NA>           2013    NA Steelers       6
## 12 <NA>           2014    NA Chargers       9
## 13 <NA>           2013    NA Browns         5
## 14 <NA>           2014    NA Buccaneers    11
## 15 <NA>           2013    NA Colts         10
## 16 <NA>           2016    NA Titans        16
## 17 <NA>           2001    NA Bears         11
## 18 <NA>           2001    NA Steelers      16
## 19 <NA>           2004    NA Cardinals      4

7. semi_join

Describe the resulting data:

Columns: team, year, wins
Rows: 1

How is it different from the original two datasets?

This dataset singles out only one data point, which is the only point that overlaps in both datasets
Has only one row, the smallest of any dataset yet

semi_join(standings_small, attendance_small)

## Joining with `by = join_by(year)`

## # A tibble: 1 × 3
##   team      year  wins
##   <chr>    <dbl> <dbl>
## 1 Carolina  2005    11

8. anti_join

Describe the resulting data:

Columns: team_name, year, week
Rows: 9

How is it different from the original two datasets?

This data set singles out all of the points that don’t align with the other data set, allowing you to see them by themselves
shows what team had the best week for attendance in all years that did not overlap in the standings data set

anti_join(attendance_small, standings_small)

## Joining with `by = join_by(year)`

## # A tibble: 9 × 3
##   team_name   year  week
##   <chr>      <dbl> <dbl>
## 1 Steelers    2013     6
## 2 Chargers    2014     9
## 3 Browns      2013     5
## 4 Buccaneers  2014    11
## 5 Colts       2013    10
## 6 Titans      2016    16
## 7 Bears       2001    11
## 8 Steelers    2001    16
## 9 Cardinals   2004     4

Week 9: Apply it to your data 8

Ethan Schena

2025-04-01

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join