Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

cbp_resp <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-11-26/cbp_resp.csv')

## Rows: 68815 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): month_grouping, month_abbv, component, land_border_region, area_of...
## dbl  (2): fiscal_year, encounter_count
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

cbp_state <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-11-26/cbp_state.csv')

## Rows: 54939 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): month_grouping, month_abbv, land_border_region, state, demographic,...
## dbl (2): fiscal_year, encounter_count
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets

Data1: cbp_resp

Columns: fiscal year, land-border-region, citizenship, encounter count
Rows: 10

Data 2: cbp_state

Columns: fiscal year, land-border-region, citizenship, state
Rows: 10

set.seed(1234)
cbp_resp_small <- cbp_resp %>% select(fiscal_year, land_border_region, citizenship, encounter_count) %>% sample_n(10)
cbp_state_small <- cbp_state %>% select(fiscal_year,land_border_region,citizenship,state) %>% sample_n(10)


cbp_resp_small

## # A tibble: 10 × 4
##    fiscal_year land_border_region    citizenship encounter_count
##          <dbl> <chr>                 <chr>                 <dbl>
##  1        2023 Other                 ROMANIA                  14
##  2        2021 Northern Land Border  COLOMBIA                  4
##  3        2022 Southwest Land Border HONDURAS                  4
##  4        2024 Other                 UKRAINE                   6
##  5        2024 Southwest Land Border MEXICO                  453
##  6        2024 Southwest Land Border HAITI                     1
##  7        2021 Southwest Land Border EL SALVADOR              27
##  8        2022 Northern Land Border  EL SALVADOR               1
##  9        2023 Southwest Land Border CUBA                      3
## 10        2021 Northern Land Border  MEXICO                    1

cbp_state_small

## # A tibble: 10 × 4
##    fiscal_year land_border_region    citizenship state
##          <dbl> <chr>                 <chr>       <chr>
##  1        2024 Southwest Land Border MEXICO      TX   
##  2        2023 Southwest Land Border NICARAGUA   AZ   
##  3        2023 Other                 NICARAGUA   CA   
##  4        2023 Other                 UKRAINE     DE   
##  5        2023 Other                 MEXICO      GA   
##  6        2024 Other                 MEXICO      DC   
##  7        2022 Southwest Land Border HAITI       CA   
##  8        2023 Other                 MEXICO      MD   
##  9        2023 Other                 OTHER       KY   
## 10        2024 Other                 VENEZUELA   IL

3. inner_join

Describe the resulting data:

Columns:fiscal year, land-border-region, citizenship, state, ecounter count
Rows: 1

How is it different from the original two datasets?

there are less rows, both original data sets had ten
There are five different grouping categories (all columns from the two data sets)

inner_join(cbp_state_small,cbp_resp_small)

## Joining with `by = join_by(fiscal_year, land_border_region, citizenship)`

## # A tibble: 1 × 5
##   fiscal_year land_border_region    citizenship state encounter_count
##         <dbl> <chr>                 <chr>       <chr>           <dbl>
## 1        2024 Southwest Land Border MEXICO      TX                453

4. left_join

Describe the resulting data:

Columns: fiscal year, land border region, citizenship, state, encounter count
Rows: 10

How is it different from the original two datasets?

It is different because it uses all five different categories
there are values within encounter count that are missing and titled N/A, this is due to the fact that cbp_state_small when made smaller did not include encounter count

left_join(cbp_state_small, cbp_resp_small)

## Joining with `by = join_by(fiscal_year, land_border_region, citizenship)`

## # A tibble: 10 × 5
##    fiscal_year land_border_region    citizenship state encounter_count
##          <dbl> <chr>                 <chr>       <chr>           <dbl>
##  1        2024 Southwest Land Border MEXICO      TX                453
##  2        2023 Southwest Land Border NICARAGUA   AZ                 NA
##  3        2023 Other                 NICARAGUA   CA                 NA
##  4        2023 Other                 UKRAINE     DE                 NA
##  5        2023 Other                 MEXICO      GA                 NA
##  6        2024 Other                 MEXICO      DC                 NA
##  7        2022 Southwest Land Border HAITI       CA                 NA
##  8        2023 Other                 MEXICO      MD                 NA
##  9        2023 Other                 OTHER       KY                 NA
## 10        2024 Other                 VENEZUELA   IL                 NA

5. right_join

Describe the resulting data:

Columns: fiscal year, land border region, citizenship, state, encounter count
Rows: 10

How is it different from the original two datasets?

It is different because it uses all five differnt categories
there are values within state that are missing and titled N/A, this is due to the fact that cbp_resp_small when made smaller did not include state as a variable

right_join(cbp_state_small,cbp_resp_small)

## Joining with `by = join_by(fiscal_year, land_border_region, citizenship)`

## # A tibble: 10 × 5
##    fiscal_year land_border_region    citizenship state encounter_count
##          <dbl> <chr>                 <chr>       <chr>           <dbl>
##  1        2024 Southwest Land Border MEXICO      TX                453
##  2        2023 Other                 ROMANIA     <NA>               14
##  3        2021 Northern Land Border  COLOMBIA    <NA>                4
##  4        2022 Southwest Land Border HONDURAS    <NA>                4
##  5        2024 Other                 UKRAINE     <NA>                6
##  6        2024 Southwest Land Border HAITI       <NA>                1
##  7        2021 Southwest Land Border EL SALVADOR <NA>               27
##  8        2022 Northern Land Border  EL SALVADOR <NA>                1
##  9        2023 Southwest Land Border CUBA        <NA>                3
## 10        2021 Northern Land Border  MEXICO      <NA>                1

6. full_join

Describe the resulting data:

Columns:fiscal year, land border region, citizenship, state, encounter count
Rows: 19

How is it different from the original two datasets?

there are 19 rows instead of 10
there are five columns instead of 4
there are missing values in the variables encounter count and state

full_join(cbp_state_small,cbp_resp_small)

## Joining with `by = join_by(fiscal_year, land_border_region, citizenship)`

## # A tibble: 19 × 5
##    fiscal_year land_border_region    citizenship state encounter_count
##          <dbl> <chr>                 <chr>       <chr>           <dbl>
##  1        2024 Southwest Land Border MEXICO      TX                453
##  2        2023 Southwest Land Border NICARAGUA   AZ                 NA
##  3        2023 Other                 NICARAGUA   CA                 NA
##  4        2023 Other                 UKRAINE     DE                 NA
##  5        2023 Other                 MEXICO      GA                 NA
##  6        2024 Other                 MEXICO      DC                 NA
##  7        2022 Southwest Land Border HAITI       CA                 NA
##  8        2023 Other                 MEXICO      MD                 NA
##  9        2023 Other                 OTHER       KY                 NA
## 10        2024 Other                 VENEZUELA   IL                 NA
## 11        2023 Other                 ROMANIA     <NA>               14
## 12        2021 Northern Land Border  COLOMBIA    <NA>                4
## 13        2022 Southwest Land Border HONDURAS    <NA>                4
## 14        2024 Other                 UKRAINE     <NA>                6
## 15        2024 Southwest Land Border HAITI       <NA>                1
## 16        2021 Southwest Land Border EL SALVADOR <NA>               27
## 17        2022 Northern Land Border  EL SALVADOR <NA>                1
## 18        2023 Southwest Land Border CUBA        <NA>                3
## 19        2021 Northern Land Border  MEXICO      <NA>                1

7. semi_join

Describe the resulting data:

Columns: fiscal year, land border region, citizenship, state
Rows: 1

How is it different from the original two datasets?

It models the cbp_state_small and does not include the variable encounter count from cbp_resp_small
There is only one row

semi_join(cbp_state_small,cbp_resp_small)

## Joining with `by = join_by(fiscal_year, land_border_region, citizenship)`

## # A tibble: 1 × 4
##   fiscal_year land_border_region    citizenship state
##         <dbl> <chr>                 <chr>       <chr>
## 1        2024 Southwest Land Border MEXICO      TX

8. anti_join

Describe the resulting data:

Columns: fiscal year, land border region, citizenship, state
Rows: 9

How is it different from the original two datasets?

There are nine rows instead of 10
Encounter count is not included which is a variable of cbp_resp_small

anti_join(cbp_state_small,cbp_resp_small)

## Joining with `by = join_by(fiscal_year, land_border_region, citizenship)`

## # A tibble: 9 × 4
##   fiscal_year land_border_region    citizenship state
##         <dbl> <chr>                 <chr>       <chr>
## 1        2023 Southwest Land Border NICARAGUA   AZ   
## 2        2023 Other                 NICARAGUA   CA   
## 3        2023 Other                 UKRAINE     DE   
## 4        2023 Other                 MEXICO      GA   
## 5        2024 Other                 MEXICO      DC   
## 6        2022 Southwest Land Border HAITI       CA   
## 7        2023 Other                 MEXICO      MD   
## 8        2023 Other                 OTHER       KY   
## 9        2024 Other                 VENEZUELA   IL

Week 9: Apply it to your data 8

Madeleine Lorenz

2025-3-21

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join