Import two related datasets from TidyTuesday Project.
team_results <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-03-26/team-results.csv')
## Rows: 236 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): TEAM, F4PERCENT, CHAMPPERCENT
## dbl (17): TEAMID, PAKE, PAKERANK, PASE, PASERANK, GAMES, W, L, WINPERCENT, R...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
public_picks <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-03-26/public-picks.csv')
## Rows: 64 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): TEAM, R64, R32, S16, E8, F4, FINALS
## dbl (2): YEAR, TEAMNO
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Describe the two datasets:
Data1: Team Results
Data 2: Public Pics
# Make team_results small with 20 random rows
team_results_small <- team_results %>%
sample_n(20) %>%
select(TEAM, GAMES, WINPERCENT)
# Make public_picks small with 20 random rows
public_picks_small <- public_picks %>%
sample_n(20) %>%
select(TEAM, R64, FINALS)
# Display the first few rows of each small dataset
team_results_small
## # A tibble: 20 × 3
## TEAM GAMES WINPERCENT
## <chr> <dbl> <dbl>
## 1 Illinois 10 0.4
## 2 Eastern Kentucky 1 0
## 3 Florida 28 0.679
## 4 Washington 9 0.556
## 5 Fairleigh Dickinson 3 0.333
## 6 Iona 7 0
## 7 Saint Mary's 13 0.385
## 8 Morgan St. 2 0
## 9 Miami FL 19 0.632
## 10 Indiana St. 1 0
## 11 Detroit 1 0
## 12 Nevada 5 0.4
## 13 Hampton 3 0
## 14 Penn 1 0
## 15 Louisville 31 0.71
## 16 Northern Colorado 1 0
## 17 Michigan St. 45 0.689
## 18 Iowa 10 0.4
## 19 Tulsa 1 0
## 20 Florida Atlantic 5 0.8
public_picks_small
## # A tibble: 20 × 3
## TEAM R64 FINALS
## <chr> <chr> <chr>
## 1 New Mexico 47.14% 0.10%
## 2 Nevada 54.15% 0.16%
## 3 Oakland 4.14% 0.04%
## 4 Saint Mary's 73.05% 0.29%
## 5 Western Kentucky 5.33% 0.04%
## 6 Florida 75.35% 0.33%
## 7 Oregon 47.53% 0.10%
## 8 Yale 9.26% 0.05%
## 9 Kansas 83.19% 0.91%
## 10 North Carolina St. 43.55% 0.07%
## 11 Drake 56.54% 0.18%
## 12 Kentucky 94.67% 3.48%
## 13 Houston 97.01% 9.27%
## 14 Grambling St. 1.73% 0.03%
## 15 Stetson 1.50% 0.03%
## 16 Longwood 2.11% 0.03%
## 17 Duquesne 24.05% 0.05%
## 18 Samford 15.29% 0.05%
## 19 Arizona 95.19% 4.06%
## 20 Long Beach St. 3.71% 0.03%
Description:
Columns: The resulting dataset will have all columns from both datasets (TEAM, GAMES, WINPERCENT from team_results_small, and R64, FINALS from public_picks_small).
Rows: Only rows where the TEAM exists in both datasets will be included.
Difference: *The inner join only keeps the intersection of both datasets. This means the resulting data has fewer rows than either dataset if not all teams match.
# Perform inner join
inner_joined_data <- inner_join(team_results_small, public_picks_small, by = "TEAM")
inner_joined_data
## # A tibble: 3 × 5
## TEAM GAMES WINPERCENT R64 FINALS
## <chr> <dbl> <dbl> <chr> <chr>
## 1 Florida 28 0.679 75.35% 0.33%
## 2 Saint Mary's 13 0.385 73.05% 0.29%
## 3 Nevada 5 0.4 54.15% 0.16%
Description: Columns: The resulting dataset will include all columns from both datasets.
Rows: All rows from team_results_small are included, and matching rows from public_picks_small. Non-matching rows in public_picks_small will have NA for the missing data.
Difference: *This retains all rows from team_results_small. If a TEAM in team_results_small has no match in public_picks_small, it will still be included with NA values for the missing columns.
# Perform left join
left_joined_data <- left_join(team_results_small, public_picks_small, by = "TEAM")
left_joined_data
## # A tibble: 20 × 5
## TEAM GAMES WINPERCENT R64 FINALS
## <chr> <dbl> <dbl> <chr> <chr>
## 1 Illinois 10 0.4 <NA> <NA>
## 2 Eastern Kentucky 1 0 <NA> <NA>
## 3 Florida 28 0.679 75.35% 0.33%
## 4 Washington 9 0.556 <NA> <NA>
## 5 Fairleigh Dickinson 3 0.333 <NA> <NA>
## 6 Iona 7 0 <NA> <NA>
## 7 Saint Mary's 13 0.385 73.05% 0.29%
## 8 Morgan St. 2 0 <NA> <NA>
## 9 Miami FL 19 0.632 <NA> <NA>
## 10 Indiana St. 1 0 <NA> <NA>
## 11 Detroit 1 0 <NA> <NA>
## 12 Nevada 5 0.4 54.15% 0.16%
## 13 Hampton 3 0 <NA> <NA>
## 14 Penn 1 0 <NA> <NA>
## 15 Louisville 31 0.71 <NA> <NA>
## 16 Northern Colorado 1 0 <NA> <NA>
## 17 Michigan St. 45 0.689 <NA> <NA>
## 18 Iowa 10 0.4 <NA> <NA>
## 19 Tulsa 1 0 <NA> <NA>
## 20 Florida Atlantic 5 0.8 <NA> <NA>
Description: Columns: Includes all columns from both datasets. Rows: All rows from public_picks_small are included, with matching rows from team_results_small. Non-matching rows in team_results_small will have NA for the missing data.
Difference: *This keeps all rows from public_picks_small. If a TEAM in public_picks_small does not appear in team_results_small, it will still be included with NA for the missing columns.
# Perform right join
right_joined_data <- right_join(team_results_small, public_picks_small, by = "TEAM")
right_joined_data
## # A tibble: 20 × 5
## TEAM GAMES WINPERCENT R64 FINALS
## <chr> <dbl> <dbl> <chr> <chr>
## 1 Florida 28 0.679 75.35% 0.33%
## 2 Saint Mary's 13 0.385 73.05% 0.29%
## 3 Nevada 5 0.4 54.15% 0.16%
## 4 New Mexico NA NA 47.14% 0.10%
## 5 Oakland NA NA 4.14% 0.04%
## 6 Western Kentucky NA NA 5.33% 0.04%
## 7 Oregon NA NA 47.53% 0.10%
## 8 Yale NA NA 9.26% 0.05%
## 9 Kansas NA NA 83.19% 0.91%
## 10 North Carolina St. NA NA 43.55% 0.07%
## 11 Drake NA NA 56.54% 0.18%
## 12 Kentucky NA NA 94.67% 3.48%
## 13 Houston NA NA 97.01% 9.27%
## 14 Grambling St. NA NA 1.73% 0.03%
## 15 Stetson NA NA 1.50% 0.03%
## 16 Longwood NA NA 2.11% 0.03%
## 17 Duquesne NA NA 24.05% 0.05%
## 18 Samford NA NA 15.29% 0.05%
## 19 Arizona NA NA 95.19% 4.06%
## 20 Long Beach St. NA NA 3.71% 0.03%
Description: Columns: All columns from both datasets. Rows: All rows from both datasets are included. Teams that don’t match will have NA values for the columns from the other dataset.
Difference: *The full join combines all rows from both datasets. If a TEAM is only in one dataset, it still appears in the result with NA for the columns from the other dataset.
# Perform full join
full_joined_data <- full_join(team_results_small, public_picks_small, by = "TEAM")
full_joined_data
## # A tibble: 37 × 5
## TEAM GAMES WINPERCENT R64 FINALS
## <chr> <dbl> <dbl> <chr> <chr>
## 1 Illinois 10 0.4 <NA> <NA>
## 2 Eastern Kentucky 1 0 <NA> <NA>
## 3 Florida 28 0.679 75.35% 0.33%
## 4 Washington 9 0.556 <NA> <NA>
## 5 Fairleigh Dickinson 3 0.333 <NA> <NA>
## 6 Iona 7 0 <NA> <NA>
## 7 Saint Mary's 13 0.385 73.05% 0.29%
## 8 Morgan St. 2 0 <NA> <NA>
## 9 Miami FL 19 0.632 <NA> <NA>
## 10 Indiana St. 1 0 <NA> <NA>
## # ℹ 27 more rows
Description: Columns: Only columns from team_results_small. Rows: Only rows where TEAM exists in both datasets are included.
Difference: *A semi join returns only the rows from team_results_small that have matches in public_picks_small, without adding any columns from public_picks_small.
# Perform semi join
semi_joined_data <- semi_join(team_results_small, public_picks_small, by = "TEAM")
semi_joined_data
## # A tibble: 3 × 3
## TEAM GAMES WINPERCENT
## <chr> <dbl> <dbl>
## 1 Florida 28 0.679
## 2 Saint Mary's 13 0.385
## 3 Nevada 5 0.4
Description: Columns: Only columns from team_results_small. Rows: Only rows from team_results_small that do not have matches in public_picks_small.
Difference: *The anti join returns rows from team_results_small that have no match in public_picks_small. This shows the teams that are missing from the other dataset.
# Perform anti join
anti_joined_data <- anti_join(team_results_small, public_picks_small, by = "TEAM")
anti_joined_data
## # A tibble: 17 × 3
## TEAM GAMES WINPERCENT
## <chr> <dbl> <dbl>
## 1 Illinois 10 0.4
## 2 Eastern Kentucky 1 0
## 3 Washington 9 0.556
## 4 Fairleigh Dickinson 3 0.333
## 5 Iona 7 0
## 6 Morgan St. 2 0
## 7 Miami FL 19 0.632
## 8 Indiana St. 1 0
## 9 Detroit 1 0
## 10 Hampton 3 0
## 11 Penn 1 0
## 12 Louisville 31 0.71
## 13 Northern Colorado 1 0
## 14 Michigan St. 45 0.689
## 15 Iowa 10 0.4
## 16 Tulsa 1 0
## 17 Florida Atlantic 5 0.8