Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

colony <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-01-11/colony.csv')

## Rows: 1222 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): months, state
## dbl (8): year, colony_n, colony_max, colony_lost, colony_lost_pct, colony_ad...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

stressor <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-01-11/stressor.csv')

## Rows: 7332 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): months, state, stressor
## dbl (2): year, stress_pct
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1: Colony

Columns: state, colony_n, colony_lost_pct
Rows: 10 rows

Data 2: Stressor

Columns: state, stressor, stress_pct
Rows: 10 rows

colony_small <- colony %>% select(state, colony_n, colony_lost_pct) %>% sample_n(10)
stressor_small <- stressor %>% select(state, stressor, stress_pct) %>% sample_n(10)

colony_small

## # A tibble: 10 × 3
##    state       colony_n colony_lost_pct
##    <chr>          <dbl>           <dbl>
##  1 Oklahoma        5000               7
##  2 Missouri        7000               3
##  3 Kentucky        8500              13
##  4 Maryland        7500               9
##  5 Oklahoma        3700              NA
##  6 Minnesota      39000               6
##  7 Louisiana      55000               7
##  8 Ohio           15500              25
##  9 Mississippi    26000               6
## 10 Wyoming        27000              12

stressor_small

## # A tibble: 10 × 3
##    state          stressor              stress_pct
##    <chr>          <chr>                      <dbl>
##  1 Illinois       Varroa mites                21.2
##  2 Oklahoma       Other pests/parasites       10.9
##  3 West Virginia  Varroa mites                23.2
##  4 North Carolina Pesticides                   2  
##  5 New Jersey     Unknown                      2.3
##  6 California     Other pests/parasites       15.8
##  7 California     Pesticides                  11.6
##  8 Georgia        Pesticides                   2.6
##  9 Mississippi    Unknown                      5.2
## 10 Ohio           Pesticides                   1.8

The number of rows might not reflect the actual number of rows from the code. It seems like the random 10 that is chosen after running the code initially is different from the one that happens when I knit the whole code

3. inner_join

Describe the resulting data:

Columns: state, colony_n, colony_lost_pct, stressor, stress_pct
Rows: 2

How is it different from the original two datasets?

2 rows compared to 10 rows in the original dataset
all columns from the two datasets

colony_small %>% inner_join(stressor_small)

## Joining with `by = join_by(state)`

## # A tibble: 4 × 5
##   state       colony_n colony_lost_pct stressor              stress_pct
##   <chr>          <dbl>           <dbl> <chr>                      <dbl>
## 1 Oklahoma        5000               7 Other pests/parasites       10.9
## 2 Oklahoma        3700              NA Other pests/parasites       10.9
## 3 Ohio           15500              25 Pesticides                   1.8
## 4 Mississippi    26000               6 Unknown                      5.2

4. left_join

Describe the resulting data:

Columns: state, colony_n, colony_lost_pct, stressor, stress_pct
Rows: 10

How is it different from the original two datasets?

all columns from the two datasets

colony_small %>% left_join(stressor_small)

## Joining with `by = join_by(state)`

## # A tibble: 10 × 5
##    state       colony_n colony_lost_pct stressor              stress_pct
##    <chr>          <dbl>           <dbl> <chr>                      <dbl>
##  1 Oklahoma        5000               7 Other pests/parasites       10.9
##  2 Missouri        7000               3 <NA>                        NA  
##  3 Kentucky        8500              13 <NA>                        NA  
##  4 Maryland        7500               9 <NA>                        NA  
##  5 Oklahoma        3700              NA Other pests/parasites       10.9
##  6 Minnesota      39000               6 <NA>                        NA  
##  7 Louisiana      55000               7 <NA>                        NA  
##  8 Ohio           15500              25 Pesticides                   1.8
##  9 Mississippi    26000               6 Unknown                      5.2
## 10 Wyoming        27000              12 <NA>                        NA

5. right_join

Describe the resulting data:

Columns: state, colony_n, colony_lost_pct, stressor, stress_pct
Rows: 10

How is it different from the original two datasets?

all columns from the two datasets

colony_small %>% right_join(stressor_small)

## Joining with `by = join_by(state)`

## # A tibble: 11 × 5
##    state          colony_n colony_lost_pct stressor              stress_pct
##    <chr>             <dbl>           <dbl> <chr>                      <dbl>
##  1 Oklahoma           5000               7 Other pests/parasites       10.9
##  2 Oklahoma           3700              NA Other pests/parasites       10.9
##  3 Ohio              15500              25 Pesticides                   1.8
##  4 Mississippi       26000               6 Unknown                      5.2
##  5 Illinois             NA              NA Varroa mites                21.2
##  6 West Virginia        NA              NA Varroa mites                23.2
##  7 North Carolina       NA              NA Pesticides                   2  
##  8 New Jersey           NA              NA Unknown                      2.3
##  9 California           NA              NA Other pests/parasites       15.8
## 10 California           NA              NA Pesticides                  11.6
## 11 Georgia              NA              NA Pesticides                   2.6

6. full_join

Describe the resulting data:

Columns: state, colony_n, colony_lost_pct, stressor, stress_pct
Rows: 18

How is it different from the original two datasets?

18 rows compared to 10 rows in the original dataset
all columns from the two datasets

colony_small %>% full_join(stressor_small)

## Joining with `by = join_by(state)`

## # A tibble: 17 × 5
##    state          colony_n colony_lost_pct stressor              stress_pct
##    <chr>             <dbl>           <dbl> <chr>                      <dbl>
##  1 Oklahoma           5000               7 Other pests/parasites       10.9
##  2 Missouri           7000               3 <NA>                        NA  
##  3 Kentucky           8500              13 <NA>                        NA  
##  4 Maryland           7500               9 <NA>                        NA  
##  5 Oklahoma           3700              NA Other pests/parasites       10.9
##  6 Minnesota         39000               6 <NA>                        NA  
##  7 Louisiana         55000               7 <NA>                        NA  
##  8 Ohio              15500              25 Pesticides                   1.8
##  9 Mississippi       26000               6 Unknown                      5.2
## 10 Wyoming           27000              12 <NA>                        NA  
## 11 Illinois             NA              NA Varroa mites                21.2
## 12 West Virginia        NA              NA Varroa mites                23.2
## 13 North Carolina       NA              NA Pesticides                   2  
## 14 New Jersey           NA              NA Unknown                      2.3
## 15 California           NA              NA Other pests/parasites       15.8
## 16 California           NA              NA Pesticides                  11.6
## 17 Georgia              NA              NA Pesticides                   2.6

7. semi_join

Describe the resulting data:

Columns: state, colony_n, colony_lost_pct
Rows: 2

How is it different from the original two datasets?

2 rows compared to 10 rows in the original dataset
not all columns from the two datasets, only state, colony_n, colony_lost_pct (so Data 1)

colony_small %>% semi_join(stressor_small)

## Joining with `by = join_by(state)`

## # A tibble: 4 × 3
##   state       colony_n colony_lost_pct
##   <chr>          <dbl>           <dbl>
## 1 Oklahoma        5000               7
## 2 Oklahoma        3700              NA
## 3 Ohio           15500              25
## 4 Mississippi    26000               6

8. anti_join

Describe the resulting data:

Columns: state, colony_n, colony_lost_pct
Rows: 8

How is it different from the original two datasets?

8 rows compared to 10 rows in the original dataset
not all columns from the two datasets, only state, colony_n, colony_lost_pct (so Data 1)

colony_small %>% anti_join(stressor_small)

## Joining with `by = join_by(state)`

## # A tibble: 6 × 3
##   state     colony_n colony_lost_pct
##   <chr>        <dbl>           <dbl>
## 1 Missouri      7000               3
## 2 Kentucky      8500              13
## 3 Maryland      7500               9
## 4 Minnesota    39000               6
## 5 Louisiana    55000               7
## 6 Wyoming      27000              12

Week 9: Apply it to your data 8

Laurence Bouchard

2026-03-14

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join