Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

forest <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2021/2021-04-06/forest.csv')

## Rows: 475 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): entity, code
## dbl (2): year, net_forest_conversion
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

forest_area <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2021/2021-04-06/forest_area.csv')

## Rows: 7846 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): entity, code
## dbl (2): year, forest_area
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1: Forest

Columns: entity, year, net_forest_conversion
Rows: 10

Data 2: Forest Area

Columns: entity, year, forest_area
Rows: 10

set.seed(1234)
forest_small <- forest %>% select(entity, year, net_forest_conversion) %>% sample_n(10)
forest_area_small <- forest_area %>% select(entity, year, forest_area) %>% sample_n(10)

forest_small

## # A tibble: 10 × 3
##    entity            year net_forest_conversion
##    <chr>            <dbl>                 <dbl>
##  1 Morocco           2000                 16800
##  2 Papua New Guinea  2000                 -9910
##  3 Switzerland       2000                  3850
##  4 Cuba              1990                 37700
##  5 Djibouti          2010                     0
##  6 Spain             2000                145140
##  7 Falkland Islands  1990                     0
##  8 Togo              2010                 -2960
##  9 Suriname          2015                -11080
## 10 South Africa      2015                -36400

forest_area_small

## # A tibble: 10 × 3
##    entity               year forest_area
##    <chr>               <dbl>       <dbl>
##  1 El Salvador          2011  0.0152    
##  2 Indonesia            1996  2.58      
##  3 Greenland            1998  0.00000527
##  4 Romania              2011  0.161     
##  5 Faroe Islands        1999  0.00000192
##  6 Burundi              2015  0.00685   
##  7 Malawi               2017  0.0581    
##  8 Trinidad and Tobago  2002  0.00569   
##  9 Micronesia           1997  0.00153   
## 10 Jordan               1996  0.00233

3. inner_join

forest_small %>% inner_join(forest_area_small, by = c("entity", "year"))

## # A tibble: 0 × 4
## # ℹ 4 variables: entity <chr>, year <dbl>, net_forest_conversion <dbl>,
## #   forest_area <dbl>

Describe the resulting data:

Columns: entity, year, net_forest_conversion, forest_area,
Rows: 0

How is it different from the original two datasets?

Only keeps rows that have matching entity and year in both datasets. In this case, there are no matches, so the result has 0 rows.

4. left_join

forest_small %>% left_join(forest_area_small, by = c("entity", "year"))

## # A tibble: 10 × 4
##    entity            year net_forest_conversion forest_area
##    <chr>            <dbl>                 <dbl>       <dbl>
##  1 Morocco           2000                 16800          NA
##  2 Papua New Guinea  2000                 -9910          NA
##  3 Switzerland       2000                  3850          NA
##  4 Cuba              1990                 37700          NA
##  5 Djibouti          2010                     0          NA
##  6 Spain             2000                145140          NA
##  7 Falkland Islands  1990                     0          NA
##  8 Togo              2010                 -2960          NA
##  9 Suriname          2015                -11080          NA
## 10 South Africa      2015                -36400          NA

Describe the resulting data:

Columns:entity, year, net_forest_conversion, forest_area
Rows:10 sample rows

How is it different from the original two datasets?

Keeps all rows from the first dataset forest_small and adds matching values from the second dataset. Missing matches are filled with NA.In this case, the forest_area column exist but contains only NA values because there are no matching rows.

5. right_join

forest_small %>% 
  right_join(forest_area_small, by = c("entity", "year"))

## # A tibble: 10 × 4
##    entity               year net_forest_conversion forest_area
##    <chr>               <dbl>                 <dbl>       <dbl>
##  1 El Salvador          2011                    NA  0.0152    
##  2 Indonesia            1996                    NA  2.58      
##  3 Greenland            1998                    NA  0.00000527
##  4 Romania              2011                    NA  0.161     
##  5 Faroe Islands        1999                    NA  0.00000192
##  6 Burundi              2015                    NA  0.00685   
##  7 Malawi               2017                    NA  0.0581    
##  8 Trinidad and Tobago  2002                    NA  0.00569   
##  9 Micronesia           1997                    NA  0.00153   
## 10 Jordan               1996                    NA  0.00233

Describe the resulting data:

Columns:entity, year, net_forest_conversion, forest_area
Rows:10

How is it different from the original two datasets?

Keeps all rows from the second dataset (forest_area_small) and adds matching values from the first dataset (forest_small). In this case, there are no matches, so the net_forest_conversion column contains only NA values.

6. full_join

full_join(forest_small, forest_area_small, by = c("entity", "year"))

## # A tibble: 20 × 4
##    entity               year net_forest_conversion forest_area
##    <chr>               <dbl>                 <dbl>       <dbl>
##  1 Morocco              2000                 16800 NA         
##  2 Papua New Guinea     2000                 -9910 NA         
##  3 Switzerland          2000                  3850 NA         
##  4 Cuba                 1990                 37700 NA         
##  5 Djibouti             2010                     0 NA         
##  6 Spain                2000                145140 NA         
##  7 Falkland Islands     1990                     0 NA         
##  8 Togo                 2010                 -2960 NA         
##  9 Suriname             2015                -11080 NA         
## 10 South Africa         2015                -36400 NA         
## 11 El Salvador          2011                    NA  0.0152    
## 12 Indonesia            1996                    NA  2.58      
## 13 Greenland            1998                    NA  0.00000527
## 14 Romania              2011                    NA  0.161     
## 15 Faroe Islands        1999                    NA  0.00000192
## 16 Burundi              2015                    NA  0.00685   
## 17 Malawi               2017                    NA  0.0581    
## 18 Trinidad and Tobago  2002                    NA  0.00569   
## 19 Micronesia           1997                    NA  0.00153   
## 20 Jordan               1996                    NA  0.00233

Describe the resulting data:

Columns:entity, year, net_forest_conversion, forest_area
Rows:20

How is it different from the original two datasets?

Combines all rows from both datasets. Unmatched rows from either dataset are included with NA values.

7. semi_join

semi_join(forest_small, forest_area_small, by = c("entity", "year"))

## # A tibble: 0 × 3
## # ℹ 3 variables: entity <chr>, year <dbl>, net_forest_conversion <dbl>

Describe the resulting data:

Columns: entity,year, net_forest_conversion
Rows:0

How is it different from the original two datasets?

Returns only rows from the first dataset that have matches in the second dataset. Here are no matches.

8. anti_join

anti_join(forest_small, forest_area_small, by = c("entity", "year"))

## # A tibble: 10 × 3
##    entity            year net_forest_conversion
##    <chr>            <dbl>                 <dbl>
##  1 Morocco           2000                 16800
##  2 Papua New Guinea  2000                 -9910
##  3 Switzerland       2000                  3850
##  4 Cuba              1990                 37700
##  5 Djibouti          2010                     0
##  6 Spain             2000                145140
##  7 Falkland Islands  1990                     0
##  8 Togo              2010                 -2960
##  9 Suriname          2015                -11080
## 10 South Africa      2015                -36400

Describe the resulting data:

Columns:entity, year, net_forest_conversion
Rows:10 rows compared to 10 rows in original dataset

How is it different from the original two datasets?

Returns rows from the first dataset that do not have matches in the second dataset.

Week 9: Apply it to your data 8

Jennifer Chorvatovic

2026-03-26

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join