Import your data

bee_colonies <- read_excel("../00_data/MyData3.xlsx")
stressors <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/stressor.csv')
## Rows: 7332 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): months, state, stressor
## dbl (2): year, stress_pct
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Make data small

set.seed(123)
bee_colonies_smaller2 <- bee_colonies %>%
    select(year, months, state, colony_size, colony_lost) %>%
    sample_n(10)

set.seed(123)
stressors_small <- stressors %>%
    select(year, months, state, stressor) %>%
    sample_n(10)

bee_colonies_smaller2
## # A tibble: 10 × 5
##     year months           state      colony_size colony_lost
##    <dbl> <chr>            <chr>            <dbl>       <dbl>
##  1  2017 January-March    Utah             16000        2700
##  2  2017 April-June       Vermont           6000         170
##  3  2015 October-December Texas           125000       25000
##  4  2017 October-December Hawaii           15000         130
##  5  2016 January-March    Florida         245000       45000
##  6  2019 October-December Wyoming          27000        3300
##  7  2021 January-March    Kansas            5000        1400
##  8  2020 July-September   California      640000       69000
##  9  2018 July-September   Florida         197000       30000
## 10  2018 January-March    Texas           205000       22000
stressors_small
## # A tibble: 10 × 4
##     year months           state          stressor             
##    <dbl> <chr>            <chr>          <chr>                
##  1  2017 January-March    South Carolina Disesases            
##  2  2017 January-March    West Virginia  Disesases            
##  3  2016 October-December West Virginia  Varroa mites         
##  4  2015 April-June       Virginia       Pesticides           
##  5  2018 October-December Illinois       Varroa mites         
##  6  2017 July-September   New York       Pesticides           
##  7  2016 July-September   Nebraska       Unknown              
##  8  2016 January-March    Arkansas       Other pests/parasites
##  9  2017 October-December Wyoming        Other                
## 10  2019 July-September   Other States   Disesases

Inner join

The resulting varibles from “bee_colonies” and “stressors”, has common variables. The variables from the dataset “stressors” is “year”, “months” and “state”, while “year”, “months”, and “state is from dataset bee_colonies. I chose the variable”state”. The resulting dataset has one row, and 8 columns.

bee_colonies_smaller2 %>% inner_join(stressors_small)
## Joining with `by = join_by(year, months, state)`
## # A tibble: 0 × 6
## # ℹ 6 variables: year <dbl>, months <chr>, state <chr>, colony_size <dbl>,
## #   colony_lost <dbl>, stressor <chr>
stressors_small %>% inner_join(bee_colonies_smaller2)
## Joining with `by = join_by(year, months, state)`
## # A tibble: 0 × 6
## # ℹ 6 variables: year <dbl>, months <chr>, state <chr>, stressor <chr>,
## #   colony_size <dbl>, colony_lost <dbl>

Left join

The resulting dataset has one row and one column. This data does not have the best related variables to perform the function.

bee_colonies_smaller2 %>% left_join(stressors_small)
## Joining with `by = join_by(year, months, state)`
## # A tibble: 10 × 6
##     year months           state      colony_size colony_lost stressor
##    <dbl> <chr>            <chr>            <dbl>       <dbl> <chr>   
##  1  2017 January-March    Utah             16000        2700 <NA>    
##  2  2017 April-June       Vermont           6000         170 <NA>    
##  3  2015 October-December Texas           125000       25000 <NA>    
##  4  2017 October-December Hawaii           15000         130 <NA>    
##  5  2016 January-March    Florida         245000       45000 <NA>    
##  6  2019 October-December Wyoming          27000        3300 <NA>    
##  7  2021 January-March    Kansas            5000        1400 <NA>    
##  8  2020 July-September   California      640000       69000 <NA>    
##  9  2018 July-September   Florida         197000       30000 <NA>    
## 10  2018 January-March    Texas           205000       22000 <NA>

Right join

The resulting dataset has 10 rows and 8 columns. This data does not have the best related variables to perform the function.

bee_colonies_smaller2 %>% right_join(stressors_small)
## Joining with `by = join_by(year, months, state)`
## # A tibble: 10 × 6
##     year months           state          colony_size colony_lost stressor       
##    <dbl> <chr>            <chr>                <dbl>       <dbl> <chr>          
##  1  2017 January-March    South Carolina          NA          NA Disesases      
##  2  2017 January-March    West Virginia           NA          NA Disesases      
##  3  2016 October-December West Virginia           NA          NA Varroa mites   
##  4  2015 April-June       Virginia                NA          NA Pesticides     
##  5  2018 October-December Illinois                NA          NA Varroa mites   
##  6  2017 July-September   New York                NA          NA Pesticides     
##  7  2016 July-September   Nebraska                NA          NA Unknown        
##  8  2016 January-March    Arkansas                NA          NA Other pests/pa…
##  9  2017 October-December Wyoming                 NA          NA Other          
## 10  2019 July-September   Other States            NA          NA Disesases

Full join

bee_colonies_smaller2 %>% full_join(stressors_small)
## Joining with `by = join_by(year, months, state)`
## # A tibble: 20 × 6
##     year months           state          colony_size colony_lost stressor       
##    <dbl> <chr>            <chr>                <dbl>       <dbl> <chr>          
##  1  2017 January-March    Utah                 16000        2700 <NA>           
##  2  2017 April-June       Vermont               6000         170 <NA>           
##  3  2015 October-December Texas               125000       25000 <NA>           
##  4  2017 October-December Hawaii               15000         130 <NA>           
##  5  2016 January-March    Florida             245000       45000 <NA>           
##  6  2019 October-December Wyoming              27000        3300 <NA>           
##  7  2021 January-March    Kansas                5000        1400 <NA>           
##  8  2020 July-September   California          640000       69000 <NA>           
##  9  2018 July-September   Florida             197000       30000 <NA>           
## 10  2018 January-March    Texas               205000       22000 <NA>           
## 11  2017 January-March    South Carolina          NA          NA Disesases      
## 12  2017 January-March    West Virginia           NA          NA Disesases      
## 13  2016 October-December West Virginia           NA          NA Varroa mites   
## 14  2015 April-June       Virginia                NA          NA Pesticides     
## 15  2018 October-December Illinois                NA          NA Varroa mites   
## 16  2017 July-September   New York                NA          NA Pesticides     
## 17  2016 July-September   Nebraska                NA          NA Unknown        
## 18  2016 January-March    Arkansas                NA          NA Other pests/pa…
## 19  2017 October-December Wyoming                 NA          NA Other          
## 20  2019 July-September   Other States            NA          NA Disesases

Semi join

stressors_small %>% semi_join(bee_colonies_smaller2)
## Joining with `by = join_by(year, months, state)`
## # A tibble: 0 × 4
## # ℹ 4 variables: year <dbl>, months <chr>, state <chr>, stressor <chr>

Anti join - filtering

bee_colonies_smaller2 %>% anti_join(stressors_small)
## Joining with `by = join_by(year, months, state)`
## # A tibble: 10 × 5
##     year months           state      colony_size colony_lost
##    <dbl> <chr>            <chr>            <dbl>       <dbl>
##  1  2017 January-March    Utah             16000        2700
##  2  2017 April-June       Vermont           6000         170
##  3  2015 October-December Texas           125000       25000
##  4  2017 October-December Hawaii           15000         130
##  5  2016 January-March    Florida         245000       45000
##  6  2019 October-December Wyoming          27000        3300
##  7  2021 January-March    Kansas            5000        1400
##  8  2020 July-September   California      640000       69000
##  9  2018 July-September   Florida         197000       30000
## 10  2018 January-March    Texas           205000       22000

Chapter 13

What are primary keys in your data?

The primary keys in my data bee_colonies, is a combination of the variables year, months, state and colony_max, when using count() function to count n = 1 for each row and observation.