Module 9: Apply it to your data 8

Import your data

chocolate <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')

## Rows: 2530 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): company_manufacturer, company_location, country_of_bean_origin, spe...
## dbl (3): ref, review_date, rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

coffee_ratings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv')

## Rows: 1339 Columns: 43
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): species, owner, country_of_origin, farm_name, lot_number, mill, ic...
## dbl (19): total_cup_points, number_of_bags, aroma, flavor, aftertaste, acidi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Chapter 13

What are primary keys in your data?

Both of these data sets are found under Tidy Tuesday and are factored around chocolate and coffee ratings. Some key factors in the date would be country and rating points.

Can you divide your data into two?

Divide it using dplyr::select in a way the two have a common variable, which you could use to join the two.

set.seed(1234)
choc_small <- chocolate %>%
    select(company_manufacturer, country_of_bean_origin, rating) %>%
    sample_n(10)

choc_small

## # A tibble: 10 × 3
##    company_manufacturer country_of_bean_origin rating
##    <chr>                <chr>                   <dbl>
##  1 Friis Holm           Nicaragua                3   
##  2 Coppeneur            Trinidad                 3   
##  3 Franceschi           Venezuela                3.5 
##  4 Cacao Market         Honduras                 3.5 
##  5 Soma                 Bolivia                  3.5 
##  6 Mission              Puerto Rico              3   
##  7 Upchurch             Tanzania                 3.25
##  8 Guittard             Nicaragua                2.75
##  9 Bittersweet Origins  Ghana                    3.5 
## 10 Sublime Origins      Belize                   3.25

set.seed(1234)
cof_small <- coffee_ratings %>%
    select(country_of_origin, company, color) %>%
    sample_n(10)

cof_small

## # A tibble: 10 × 3
##    country_of_origin            company                          color       
##    <chr>                        <chr>                            <chr>       
##  1 Haiti                        haiti coffee                     Blue-Green  
##  2 Myanmar                      sithar coffee co., ltd.          <NA>        
##  3 Mexico                       globus coffee                    Green       
##  4 Guatemala                    unex guatemala, s.a.             Green       
##  5 Colombia                     exportadora de cafe condor s.a   Bluish-Green
##  6 Mexico                       union ramal santa cruz spr de ri Green       
##  7 Mexico                       sustainable harvest              Green       
##  8 United States (Hawaii)       <NA>                             <NA>        
##  9 Brazil                       ipanema coffees                  Green       
## 10 Tanzania, United Republic Of acacia hills ltd                 Green

Can you join the two together?

Use tidyr::left_join or other joining functions.

Left Join

choc_small %>% left_join(cof_small, by = c("country_of_bean_origin" = "country_of_origin"))

## # A tibble: 10 × 5
##    company_manufacturer country_of_bean_origin rating company color
##    <chr>                <chr>                   <dbl> <chr>   <chr>
##  1 Friis Holm           Nicaragua                3    <NA>    <NA> 
##  2 Coppeneur            Trinidad                 3    <NA>    <NA> 
##  3 Franceschi           Venezuela                3.5  <NA>    <NA> 
##  4 Cacao Market         Honduras                 3.5  <NA>    <NA> 
##  5 Soma                 Bolivia                  3.5  <NA>    <NA> 
##  6 Mission              Puerto Rico              3    <NA>    <NA> 
##  7 Upchurch             Tanzania                 3.25 <NA>    <NA> 
##  8 Guittard             Nicaragua                2.75 <NA>    <NA> 
##  9 Bittersweet Origins  Ghana                    3.5  <NA>    <NA> 
## 10 Sublime Origins      Belize                   3.25 <NA>    <NA>

Right Join

choc_small %>% right_join(cof_small, by = c("country_of_bean_origin" = "country_of_origin"))

## # A tibble: 10 × 5
##    company_manufacturer country_of_bean_origin       rating company        color
##    <chr>                <chr>                         <dbl> <chr>          <chr>
##  1 <NA>                 Haiti                            NA haiti coffee   Blue…
##  2 <NA>                 Myanmar                          NA sithar coffee… <NA> 
##  3 <NA>                 Mexico                           NA globus coffee  Green
##  4 <NA>                 Guatemala                        NA unex guatemal… Green
##  5 <NA>                 Colombia                         NA exportadora d… Blui…
##  6 <NA>                 Mexico                           NA union ramal s… Green
##  7 <NA>                 Mexico                           NA sustainable h… Green
##  8 <NA>                 United States (Hawaii)           NA <NA>           <NA> 
##  9 <NA>                 Brazil                           NA ipanema coffe… Green
## 10 <NA>                 Tanzania, United Republic Of     NA acacia hills … Green

Full Join

choc_small %>% full_join(cof_small, by = c("country_of_bean_origin" = "country_of_origin"))

## # A tibble: 20 × 5
##    company_manufacturer country_of_bean_origin       rating company        color
##    <chr>                <chr>                         <dbl> <chr>          <chr>
##  1 Friis Holm           Nicaragua                      3    <NA>           <NA> 
##  2 Coppeneur            Trinidad                       3    <NA>           <NA> 
##  3 Franceschi           Venezuela                      3.5  <NA>           <NA> 
##  4 Cacao Market         Honduras                       3.5  <NA>           <NA> 
##  5 Soma                 Bolivia                        3.5  <NA>           <NA> 
##  6 Mission              Puerto Rico                    3    <NA>           <NA> 
##  7 Upchurch             Tanzania                       3.25 <NA>           <NA> 
##  8 Guittard             Nicaragua                      2.75 <NA>           <NA> 
##  9 Bittersweet Origins  Ghana                          3.5  <NA>           <NA> 
## 10 Sublime Origins      Belize                         3.25 <NA>           <NA> 
## 11 <NA>                 Haiti                         NA    haiti coffee   Blue…
## 12 <NA>                 Myanmar                       NA    sithar coffee… <NA> 
## 13 <NA>                 Mexico                        NA    globus coffee  Green
## 14 <NA>                 Guatemala                     NA    unex guatemal… Green
## 15 <NA>                 Colombia                      NA    exportadora d… Blui…
## 16 <NA>                 Mexico                        NA    union ramal s… Green
## 17 <NA>                 Mexico                        NA    sustainable h… Green
## 18 <NA>                 United States (Hawaii)        NA    <NA>           <NA> 
## 19 <NA>                 Brazil                        NA    ipanema coffe… Green
## 20 <NA>                 Tanzania, United Republic Of  NA    acacia hills … Green

Conclusion

While both data sets are similar, they have many differences. Column names are different, and the data in them don’t exactly match as well as I would have thought. I started by making each data set smaller because it’s easier to work with. I picked column names that sounded similar so joining the data would be easier. When I joined the data with the left joining function, I used the chocolate rating as my base data set. The results were less interesting than I would have hoped because I didn’t have any overlapping data between the two. I didn’t find any critical or distinctive characteristics between the two data sets in any of the joining functions I tried.