Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

# csv file
freedom <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-22/freedom.csv')

## Rows: 4979 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): country, Status, Region_Name
## dbl (5): year, CL, PR, Region_Code, is_ldc
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ratio <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-07/student_teacher_ratio.csv')

## Rows: 5189 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): edulit_ind, indicator, country_code, country, flag_codes, flags
## dbl (2): year, student_ratio
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Make data small

Describe the two datasets:

Data1

Columns: Country, Year, Region_name,
Rows: 50

Data 2

Columns: country, year
Rows: 50

set.seed(1234)
freedom_small <- freedom %>% select(country, year, Region_Name) %>% sample_n(50)
ratio_small <- ratio %>% select(country, year, student_ratio) %>% sample_n(50)

freedom_small

## # A tibble: 50 × 3
##    country     year Region_Name
##    <chr>      <dbl> <chr>      
##  1 Congo       2010 Africa     
##  2 Brazil      2019 Americas   
##  3 Mali        2009 Africa     
##  4 China       2018 Asia       
##  5 Tonga       2005 Oceania    
##  6 Montenegro  2015 Europe     
##  7 Jamaica     2008 Americas   
##  8 Nicaragua   2008 Americas   
##  9 Mauritania  2012 Africa     
## 10 Latvia      2002 Europe     
## # ℹ 40 more rows

ratio_small

## # A tibble: 50 × 3
##    country                         year student_ratio
##    <chr>                          <dbl>         <dbl>
##  1 Singapore                       2016          15.1
##  2 Russian Federation              2015          20.1
##  3 Small Island Developing States  2016          21.6
##  4 Small Island Developing States  2014          23.0
##  5 Bulgaria                        2016          11.8
##  6 Tonga                           2015          10.5
##  7 Honduras                        2015          29.1
##  8 South Africa                    2015          30.3
##  9 Asia (Central)                  2013          17.4
## 10 Solomon Islands                 2013          33.8
## # ℹ 40 more rows

3. inner_join

Describe the resulting data:

Columns: Country, year, Student_ratio, Region_name
Rows: 0

How is it different from the original two datasets? Now 4 variables with 0 rows

ratio_small %>% inner_join(freedom_small, by = c("country", "year"))

## # A tibble: 0 × 4
## # ℹ 4 variables: country <chr>, year <dbl>, student_ratio <dbl>,
## #   Region_Name <chr>

4. left_join

Describe the resulting data: becomes unavailble

Columns: 4
Rows: 50

How is it different from the original two datasets? It merged the data sets and introduced NA values for region_name since it joined from left

ratio_small %>% left_join(freedom_small, by = c("country", "year"))

## # A tibble: 50 × 4
##    country                         year student_ratio Region_Name
##    <chr>                          <dbl>         <dbl> <chr>      
##  1 Singapore                       2016          15.1 <NA>       
##  2 Russian Federation              2015          20.1 <NA>       
##  3 Small Island Developing States  2016          21.6 <NA>       
##  4 Small Island Developing States  2014          23.0 <NA>       
##  5 Bulgaria                        2016          11.8 <NA>       
##  6 Tonga                           2015          10.5 <NA>       
##  7 Honduras                        2015          29.1 <NA>       
##  8 South Africa                    2015          30.3 <NA>       
##  9 Asia (Central)                  2013          17.4 <NA>       
## 10 Solomon Islands                 2013          33.8 <NA>       
## # ℹ 40 more rows

5. right_join

Describe the resulting data:

Columns: 4
Rows: 50

How is it different from the original two datasets? It merged the data sets and introduced NA values for region_name since it joined from right

ratio_small %>% right_join(freedom_small, by = c("country", "year"))

## # A tibble: 50 × 4
##    country     year student_ratio Region_Name
##    <chr>      <dbl>         <dbl> <chr>      
##  1 Congo       2010            NA Africa     
##  2 Brazil      2019            NA Americas   
##  3 Mali        2009            NA Africa     
##  4 China       2018            NA Asia       
##  5 Tonga       2005            NA Oceania    
##  6 Montenegro  2015            NA Europe     
##  7 Jamaica     2008            NA Americas   
##  8 Nicaragua   2008            NA Americas   
##  9 Mauritania  2012            NA Africa     
## 10 Latvia      2002            NA Europe     
## # ℹ 40 more rows

6. full_join

Describe the resulting data:

Columns: 4
Rows: 1000

How is it different from the original two datasets? Merged so there is 100 rows

ratio_small %>% full_join(freedom_small, by = c("country", "year"))

## # A tibble: 100 × 4
##    country                         year student_ratio Region_Name
##    <chr>                          <dbl>         <dbl> <chr>      
##  1 Singapore                       2016          15.1 <NA>       
##  2 Russian Federation              2015          20.1 <NA>       
##  3 Small Island Developing States  2016          21.6 <NA>       
##  4 Small Island Developing States  2014          23.0 <NA>       
##  5 Bulgaria                        2016          11.8 <NA>       
##  6 Tonga                           2015          10.5 <NA>       
##  7 Honduras                        2015          29.1 <NA>       
##  8 South Africa                    2015          30.3 <NA>       
##  9 Asia (Central)                  2013          17.4 <NA>       
## 10 Solomon Islands                 2013          33.8 <NA>       
## # ℹ 90 more rows

7. semi_join

Describe the resulting data:

Columns:3
Rows: 0

How is it different from the original two datasets? Couldnt find rows by semi join

ratio_small %>% semi_join(freedom_small, by = c("country", "year"))

## # A tibble: 0 × 3
## # ℹ 3 variables: country <chr>, year <dbl>, student_ratio <dbl>

8. anti_join

Describe the resulting data:

Columns: 3
Rows: 50

How is it different from the original two datasets? Removes the variable with NAs

ratio_small %>% anti_join(freedom_small, by = c("country", "year"))

## # A tibble: 50 × 3
##    country                         year student_ratio
##    <chr>                          <dbl>         <dbl>
##  1 Singapore                       2016          15.1
##  2 Russian Federation              2015          20.1
##  3 Small Island Developing States  2016          21.6
##  4 Small Island Developing States  2014          23.0
##  5 Bulgaria                        2016          11.8
##  6 Tonga                           2015          10.5
##  7 Honduras                        2015          29.1
##  8 South Africa                    2015          30.3
##  9 Asia (Central)                  2013          17.4
## 10 Solomon Islands                 2013          33.8
## # ℹ 40 more rows

Week 9: Apply it to your data 8

Vincent Fosser

2025-10-22

1. Import your data

2. Make data small

3. inner_join

4. left_join

5. right_join

6. full_join

7. semi_join

8. anti_join