Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

team_results <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-03-26/team-results.csv')

## Rows: 236 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): TEAM, F4PERCENT, CHAMPPERCENT
## dbl (17): TEAMID, PAKE, PAKERANK, PASE, PASERANK, GAMES, W, L, WINPERCENT, R...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

public_picks <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-03-26/public-picks.csv')

## Rows: 64 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): TEAM, R64, R32, S16, E8, F4, FINALS
## dbl (2): YEAR, TEAMNO
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1: Team Results

Columns: Team/School, Games Played, Win Percentage
Rows: 20 Random Rows

Data 2: Public Pics

Columns: Team/School, Round of 64 percent of picks, Finals percent of picks
Rows: 20 Random Rows

# Make team_results small with 20 random rows
team_results_small <- team_results %>%
  sample_n(20) %>%
  select(TEAM, GAMES, WINPERCENT)

# Make public_picks small with 20 random rows
public_picks_small <- public_picks %>%
  sample_n(20) %>%
  select(TEAM, R64, FINALS)

# Display the first few rows of each small dataset
team_results_small

## # A tibble: 20 × 3
##    TEAM                GAMES WINPERCENT
##    <chr>               <dbl>      <dbl>
##  1 Illinois               10      0.4  
##  2 Eastern Kentucky        1      0    
##  3 Florida                28      0.679
##  4 Washington              9      0.556
##  5 Fairleigh Dickinson     3      0.333
##  6 Iona                    7      0    
##  7 Saint Mary's           13      0.385
##  8 Morgan St.              2      0    
##  9 Miami FL               19      0.632
## 10 Indiana St.             1      0    
## 11 Detroit                 1      0    
## 12 Nevada                  5      0.4  
## 13 Hampton                 3      0    
## 14 Penn                    1      0    
## 15 Louisville             31      0.71 
## 16 Northern Colorado       1      0    
## 17 Michigan St.           45      0.689
## 18 Iowa                   10      0.4  
## 19 Tulsa                   1      0    
## 20 Florida Atlantic        5      0.8

public_picks_small

## # A tibble: 20 × 3
##    TEAM               R64    FINALS
##    <chr>              <chr>  <chr> 
##  1 New Mexico         47.14% 0.10% 
##  2 Nevada             54.15% 0.16% 
##  3 Oakland            4.14%  0.04% 
##  4 Saint Mary's       73.05% 0.29% 
##  5 Western Kentucky   5.33%  0.04% 
##  6 Florida            75.35% 0.33% 
##  7 Oregon             47.53% 0.10% 
##  8 Yale               9.26%  0.05% 
##  9 Kansas             83.19% 0.91% 
## 10 North Carolina St. 43.55% 0.07% 
## 11 Drake              56.54% 0.18% 
## 12 Kentucky           94.67% 3.48% 
## 13 Houston            97.01% 9.27% 
## 14 Grambling St.      1.73%  0.03% 
## 15 Stetson            1.50%  0.03% 
## 16 Longwood           2.11%  0.03% 
## 17 Duquesne           24.05% 0.05% 
## 18 Samford            15.29% 0.05% 
## 19 Arizona            95.19% 4.06% 
## 20 Long Beach St.     3.71%  0.03%

Description:

Columns: The resulting dataset will have all columns from both datasets (TEAM, GAMES, WINPERCENT from team_results_small, and R64, FINALS from public_picks_small).

Rows: Only rows where the TEAM exists in both datasets will be included.

Difference: *The inner join only keeps the intersection of both datasets. This means the resulting data has fewer rows than either dataset if not all teams match.

# Perform inner join
inner_joined_data <- inner_join(team_results_small, public_picks_small, by = "TEAM")

inner_joined_data

## # A tibble: 3 × 5
##   TEAM         GAMES WINPERCENT R64    FINALS
##   <chr>        <dbl>      <dbl> <chr>  <chr> 
## 1 Florida         28      0.679 75.35% 0.33% 
## 2 Saint Mary's    13      0.385 73.05% 0.29% 
## 3 Nevada           5      0.4   54.15% 0.16%

4. left_join

Description: Columns: The resulting dataset will include all columns from both datasets.

Rows: All rows from team_results_small are included, and matching rows from public_picks_small. Non-matching rows in public_picks_small will have NA for the missing data.

Difference: *This retains all rows from team_results_small. If a TEAM in team_results_small has no match in public_picks_small, it will still be included with NA values for the missing columns.

# Perform left join
left_joined_data <- left_join(team_results_small, public_picks_small, by = "TEAM")

left_joined_data

## # A tibble: 20 × 5
##    TEAM                GAMES WINPERCENT R64    FINALS
##    <chr>               <dbl>      <dbl> <chr>  <chr> 
##  1 Illinois               10      0.4   <NA>   <NA>  
##  2 Eastern Kentucky        1      0     <NA>   <NA>  
##  3 Florida                28      0.679 75.35% 0.33% 
##  4 Washington              9      0.556 <NA>   <NA>  
##  5 Fairleigh Dickinson     3      0.333 <NA>   <NA>  
##  6 Iona                    7      0     <NA>   <NA>  
##  7 Saint Mary's           13      0.385 73.05% 0.29% 
##  8 Morgan St.              2      0     <NA>   <NA>  
##  9 Miami FL               19      0.632 <NA>   <NA>  
## 10 Indiana St.             1      0     <NA>   <NA>  
## 11 Detroit                 1      0     <NA>   <NA>  
## 12 Nevada                  5      0.4   54.15% 0.16% 
## 13 Hampton                 3      0     <NA>   <NA>  
## 14 Penn                    1      0     <NA>   <NA>  
## 15 Louisville             31      0.71  <NA>   <NA>  
## 16 Northern Colorado       1      0     <NA>   <NA>  
## 17 Michigan St.           45      0.689 <NA>   <NA>  
## 18 Iowa                   10      0.4   <NA>   <NA>  
## 19 Tulsa                   1      0     <NA>   <NA>  
## 20 Florida Atlantic        5      0.8   <NA>   <NA>

5. right_join

Description: Columns: Includes all columns from both datasets. Rows: All rows from public_picks_small are included, with matching rows from team_results_small. Non-matching rows in team_results_small will have NA for the missing data.

Difference: *This keeps all rows from public_picks_small. If a TEAM in public_picks_small does not appear in team_results_small, it will still be included with NA for the missing columns.

# Perform right join
right_joined_data <- right_join(team_results_small, public_picks_small, by = "TEAM")

right_joined_data

## # A tibble: 20 × 5
##    TEAM               GAMES WINPERCENT R64    FINALS
##    <chr>              <dbl>      <dbl> <chr>  <chr> 
##  1 Florida               28      0.679 75.35% 0.33% 
##  2 Saint Mary's          13      0.385 73.05% 0.29% 
##  3 Nevada                 5      0.4   54.15% 0.16% 
##  4 New Mexico            NA     NA     47.14% 0.10% 
##  5 Oakland               NA     NA     4.14%  0.04% 
##  6 Western Kentucky      NA     NA     5.33%  0.04% 
##  7 Oregon                NA     NA     47.53% 0.10% 
##  8 Yale                  NA     NA     9.26%  0.05% 
##  9 Kansas                NA     NA     83.19% 0.91% 
## 10 North Carolina St.    NA     NA     43.55% 0.07% 
## 11 Drake                 NA     NA     56.54% 0.18% 
## 12 Kentucky              NA     NA     94.67% 3.48% 
## 13 Houston               NA     NA     97.01% 9.27% 
## 14 Grambling St.         NA     NA     1.73%  0.03% 
## 15 Stetson               NA     NA     1.50%  0.03% 
## 16 Longwood              NA     NA     2.11%  0.03% 
## 17 Duquesne              NA     NA     24.05% 0.05% 
## 18 Samford               NA     NA     15.29% 0.05% 
## 19 Arizona               NA     NA     95.19% 4.06% 
## 20 Long Beach St.        NA     NA     3.71%  0.03%

6. full_join

Description: Columns: All columns from both datasets. Rows: All rows from both datasets are included. Teams that don’t match will have NA values for the columns from the other dataset.

Difference: *The full join combines all rows from both datasets. If a TEAM is only in one dataset, it still appears in the result with NA for the columns from the other dataset.

# Perform full join
full_joined_data <- full_join(team_results_small, public_picks_small, by = "TEAM")

full_joined_data

## # A tibble: 37 × 5
##    TEAM                GAMES WINPERCENT R64    FINALS
##    <chr>               <dbl>      <dbl> <chr>  <chr> 
##  1 Illinois               10      0.4   <NA>   <NA>  
##  2 Eastern Kentucky        1      0     <NA>   <NA>  
##  3 Florida                28      0.679 75.35% 0.33% 
##  4 Washington              9      0.556 <NA>   <NA>  
##  5 Fairleigh Dickinson     3      0.333 <NA>   <NA>  
##  6 Iona                    7      0     <NA>   <NA>  
##  7 Saint Mary's           13      0.385 73.05% 0.29% 
##  8 Morgan St.              2      0     <NA>   <NA>  
##  9 Miami FL               19      0.632 <NA>   <NA>  
## 10 Indiana St.             1      0     <NA>   <NA>  
## # ℹ 27 more rows

7. semi_join

Description: Columns: Only columns from team_results_small. Rows: Only rows where TEAM exists in both datasets are included.

Difference: *A semi join returns only the rows from team_results_small that have matches in public_picks_small, without adding any columns from public_picks_small.

# Perform semi join
semi_joined_data <- semi_join(team_results_small, public_picks_small, by = "TEAM")

semi_joined_data

## # A tibble: 3 × 3
##   TEAM         GAMES WINPERCENT
##   <chr>        <dbl>      <dbl>
## 1 Florida         28      0.679
## 2 Saint Mary's    13      0.385
## 3 Nevada           5      0.4

8. anti_join

Description: Columns: Only columns from team_results_small. Rows: Only rows from team_results_small that do not have matches in public_picks_small.

Difference: *The anti join returns rows from team_results_small that have no match in public_picks_small. This shows the teams that are missing from the other dataset.

# Perform anti join
anti_joined_data <- anti_join(team_results_small, public_picks_small, by = "TEAM")

anti_joined_data

## # A tibble: 17 × 3
##    TEAM                GAMES WINPERCENT
##    <chr>               <dbl>      <dbl>
##  1 Illinois               10      0.4  
##  2 Eastern Kentucky        1      0    
##  3 Washington              9      0.556
##  4 Fairleigh Dickinson     3      0.333
##  5 Iona                    7      0    
##  6 Morgan St.              2      0    
##  7 Miami FL               19      0.632
##  8 Indiana St.             1      0    
##  9 Detroit                 1      0    
## 10 Hampton                 3      0    
## 11 Penn                    1      0    
## 12 Louisville             31      0.71 
## 13 Northern Colorado       1      0    
## 14 Michigan St.           45      0.689
## 15 Iowa                   10      0.4  
## 16 Tulsa                   1      0    
## 17 Florida Atlantic        5      0.8