Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

team_results <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-03-26/team-results.csv')

## Rows: 236 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): TEAM, F4PERCENT, CHAMPPERCENT
## dbl (17): TEAMID, PAKE, PAKERANK, PASE, PASERANK, GAMES, W, L, WINPERCENT, R...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

public_picks <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-03-26/public-picks.csv')

## Rows: 64 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): TEAM, R64, R32, S16, E8, F4, FINALS
## dbl (2): YEAR, TEAMNO
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1: public_picks

Columns: Team, Year
Rows: 30 rows

Data 2: team_results

Columns: Team, TeamID
Rows: 30 rows

team_results %>% select(TEAM, TEAMID) %>% sample_n(30)

## # A tibble: 30 × 2
##    TEAM           TEAMID
##    <chr>           <dbl>
##  1 Miami FL          113
##  2 UNC Greensboro    215
##  3 Saint Peter's     176
##  4 Holy Cross         75
##  5 Wisconsin         240
##  6 Seton Hall        181
##  7 BYU                25
##  8 Saint Joseph's    173
##  9 UTSA              223
## 10 Norfolk St.       134
## # ℹ 20 more rows

team_results_small <- team_results %>% select(TEAM, TEAMID) %>% sample_n(30)

public_picks %>% select(TEAM, YEAR) %>% sample_n(30)

## # A tibble: 30 × 2
##    TEAM          YEAR
##    <chr>        <dbl>
##  1 Stetson       2024
##  2 Clemson       2024
##  3 Alabama       2024
##  4 Iowa St.      2024
##  5 Saint Mary's  2024
##  6 TCU           2024
##  7 Kentucky      2024
##  8 Colgate       2024
##  9 Kansas        2024
## 10 Vermont       2024
## # ℹ 20 more rows

public_picks_small <- public_picks %>% select(TEAM, YEAR) %>% sample_n(30)

team_results_small

## # A tibble: 30 × 2
##    TEAM         TEAMID
##    <chr>         <dbl>
##  1 Lafayette        92
##  2 Morehead St.    125
##  3 Hampton          71
##  4 Dayton           45
##  5 Davidson         44
##  6 Princeton       165
##  7 Norfolk St.     134
##  8 Xavier          244
##  9 Wofford         241
## 10 Kent St.         89
## # ℹ 20 more rows

public_picks_small

## # A tibble: 30 × 2
##    TEAM              YEAR
##    <chr>            <dbl>
##  1 Saint Mary's      2024
##  2 McNeese St.       2024
##  3 Wisconsin         2024
##  4 South Dakota St.  2024
##  5 Kansas            2024
##  6 UAB               2024
##  7 Oakland           2024
##  8 Florida Atlantic  2024
##  9 Iowa St.          2024
## 10 Longwood          2024
## # ℹ 20 more rows

3. inner_join

Describe the resulting data:

Columns: Team, Year
Rows: 4

How is it different from the original two datasets?

4 rows compared to 30 rows in the original dataset
All columns from the two datasets

team_results_small %>% inner_join(public_picks_small)

## Joining with `by = join_by(TEAM)`

## # A tibble: 4 × 3
##   TEAM                  TEAMID  YEAR
##   <chr>                  <dbl> <dbl>
## 1 Morehead St.             125  2024
## 2 Dayton                    45  2024
## 3 Creighton                 43  2024
## 4 College of Charleston     37  2024

4. left_join

Describe the resulting data:

Columns: Team, Year, TeamID
Rows: 30

How is it different from the original two datasets?

Number of rows stayed the same
All columns from the two datasets, some TeamID rows are missing data

left_join(public_picks_small, team_results_small)

## Joining with `by = join_by(TEAM)`

## # A tibble: 30 × 3
##    TEAM              YEAR TEAMID
##    <chr>            <dbl>  <dbl>
##  1 Saint Mary's      2024     NA
##  2 McNeese St.       2024     NA
##  3 Wisconsin         2024     NA
##  4 South Dakota St.  2024     NA
##  5 Kansas            2024     NA
##  6 UAB               2024     NA
##  7 Oakland           2024     NA
##  8 Florida Atlantic  2024     NA
##  9 Iowa St.          2024     NA
## 10 Longwood          2024     NA
## # ℹ 20 more rows

5. right_join

Describe the resulting data:

Columns: Team, Year, TeamID
Rows:30

How is it different from the original two datasets?

Number of rows stayed the same
All columns from the two datasets, some Year rows are missing data

right_join(public_picks_small, team_results_small)

## Joining with `by = join_by(TEAM)`

## # A tibble: 30 × 3
##    TEAM                   YEAR TEAMID
##    <chr>                 <dbl>  <dbl>
##  1 Dayton                 2024     45
##  2 Morehead St.           2024    125
##  3 College of Charleston  2024     37
##  4 Creighton              2024     43
##  5 Lafayette                NA     92
##  6 Hampton                  NA     71
##  7 Davidson                 NA     44
##  8 Princeton                NA    165
##  9 Norfolk St.              NA    134
## 10 Xavier                   NA    244
## # ℹ 20 more rows

6. full_join

Describe the resulting data:

Columns: Team, Year, TeamID
Rows: 56

How is it different from the original two datasets?

Number of rows almost doubled, 57 compared to 30 in the orginial
All columns from the two datasets

full_join(public_picks_small, team_results_small)

## Joining with `by = join_by(TEAM)`

## # A tibble: 56 × 3
##    TEAM              YEAR TEAMID
##    <chr>            <dbl>  <dbl>
##  1 Saint Mary's      2024     NA
##  2 McNeese St.       2024     NA
##  3 Wisconsin         2024     NA
##  4 South Dakota St.  2024     NA
##  5 Kansas            2024     NA
##  6 UAB               2024     NA
##  7 Oakland           2024     NA
##  8 Florida Atlantic  2024     NA
##  9 Iowa St.          2024     NA
## 10 Longwood          2024     NA
## # ℹ 46 more rows

7. semi_join

Describe the resulting data:

Columns: Team, Year
Rows: 4

How is it different from the original two datasets?

4 rows compared to 30 in the original
Only columns from public_picks dataset

semi_join(public_picks_small, team_results_small)

## Joining with `by = join_by(TEAM)`

## # A tibble: 4 × 2
##   TEAM                   YEAR
##   <chr>                 <dbl>
## 1 Dayton                 2024
## 2 Morehead St.           2024
## 3 College of Charleston  2024
## 4 Creighton              2024

8. anti_join

Describe the resulting data:

Columns: Team, Year
Rows: 26

How is it different from the original two datasets?

Only 26 rows compared to 30
Only columns from public_picks dataset

anti_join(public_picks_small, team_results_small)

## Joining with `by = join_by(TEAM)`

## # A tibble: 26 × 2
##    TEAM              YEAR
##    <chr>            <dbl>
##  1 Saint Mary's      2024
##  2 McNeese St.       2024
##  3 Wisconsin         2024
##  4 South Dakota St.  2024
##  5 Kansas            2024
##  6 UAB               2024
##  7 Oakland           2024
##  8 Florida Atlantic  2024
##  9 Iowa St.          2024
## 10 Longwood          2024
## # ℹ 16 more rows

Week 9: Apply it to your data 8

Bella Kalinyak

2024-06-11

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join