chocolate <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')
## Rows: 2530 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): company_manufacturer, company_location, country_of_bean_origin, spe...
## dbl (3): ref, review_date, rating
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
coffee_ratings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv')
## Rows: 1339 Columns: 43
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): species, owner, country_of_origin, farm_name, lot_number, mill, ic...
## dbl (19): total_cup_points, number_of_bags, aroma, flavor, aftertaste, acidi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Both of these data sets are found under Tidy Tuesday and are factored around chocolate and coffee ratings. Some key factors in the date would be country and rating points.
Divide it using dplyr::select in a way the two have a common variable, which you could use to join the two.
set.seed(1234)
choc_small <- chocolate %>%
select(company_manufacturer, country_of_bean_origin, rating) %>%
sample_n(10)
choc_small
## # A tibble: 10 × 3
## company_manufacturer country_of_bean_origin rating
## <chr> <chr> <dbl>
## 1 Friis Holm Nicaragua 3
## 2 Coppeneur Trinidad 3
## 3 Franceschi Venezuela 3.5
## 4 Cacao Market Honduras 3.5
## 5 Soma Bolivia 3.5
## 6 Mission Puerto Rico 3
## 7 Upchurch Tanzania 3.25
## 8 Guittard Nicaragua 2.75
## 9 Bittersweet Origins Ghana 3.5
## 10 Sublime Origins Belize 3.25
set.seed(1234)
cof_small <- coffee_ratings %>%
select(country_of_origin, company, color) %>%
sample_n(10)
cof_small
## # A tibble: 10 × 3
## country_of_origin company color
## <chr> <chr> <chr>
## 1 Haiti haiti coffee Blue-Green
## 2 Myanmar sithar coffee co., ltd. <NA>
## 3 Mexico globus coffee Green
## 4 Guatemala unex guatemala, s.a. Green
## 5 Colombia exportadora de cafe condor s.a Bluish-Green
## 6 Mexico union ramal santa cruz spr de ri Green
## 7 Mexico sustainable harvest Green
## 8 United States (Hawaii) <NA> <NA>
## 9 Brazil ipanema coffees Green
## 10 Tanzania, United Republic Of acacia hills ltd Green
Use tidyr::left_join or other joining functions.
choc_small %>% left_join(cof_small, by = c("country_of_bean_origin" = "country_of_origin"))
## # A tibble: 10 × 5
## company_manufacturer country_of_bean_origin rating company color
## <chr> <chr> <dbl> <chr> <chr>
## 1 Friis Holm Nicaragua 3 <NA> <NA>
## 2 Coppeneur Trinidad 3 <NA> <NA>
## 3 Franceschi Venezuela 3.5 <NA> <NA>
## 4 Cacao Market Honduras 3.5 <NA> <NA>
## 5 Soma Bolivia 3.5 <NA> <NA>
## 6 Mission Puerto Rico 3 <NA> <NA>
## 7 Upchurch Tanzania 3.25 <NA> <NA>
## 8 Guittard Nicaragua 2.75 <NA> <NA>
## 9 Bittersweet Origins Ghana 3.5 <NA> <NA>
## 10 Sublime Origins Belize 3.25 <NA> <NA>
choc_small %>% right_join(cof_small, by = c("country_of_bean_origin" = "country_of_origin"))
## # A tibble: 10 × 5
## company_manufacturer country_of_bean_origin rating company color
## <chr> <chr> <dbl> <chr> <chr>
## 1 <NA> Haiti NA haiti coffee Blue…
## 2 <NA> Myanmar NA sithar coffee… <NA>
## 3 <NA> Mexico NA globus coffee Green
## 4 <NA> Guatemala NA unex guatemal… Green
## 5 <NA> Colombia NA exportadora d… Blui…
## 6 <NA> Mexico NA union ramal s… Green
## 7 <NA> Mexico NA sustainable h… Green
## 8 <NA> United States (Hawaii) NA <NA> <NA>
## 9 <NA> Brazil NA ipanema coffe… Green
## 10 <NA> Tanzania, United Republic Of NA acacia hills … Green
choc_small %>% full_join(cof_small, by = c("country_of_bean_origin" = "country_of_origin"))
## # A tibble: 20 × 5
## company_manufacturer country_of_bean_origin rating company color
## <chr> <chr> <dbl> <chr> <chr>
## 1 Friis Holm Nicaragua 3 <NA> <NA>
## 2 Coppeneur Trinidad 3 <NA> <NA>
## 3 Franceschi Venezuela 3.5 <NA> <NA>
## 4 Cacao Market Honduras 3.5 <NA> <NA>
## 5 Soma Bolivia 3.5 <NA> <NA>
## 6 Mission Puerto Rico 3 <NA> <NA>
## 7 Upchurch Tanzania 3.25 <NA> <NA>
## 8 Guittard Nicaragua 2.75 <NA> <NA>
## 9 Bittersweet Origins Ghana 3.5 <NA> <NA>
## 10 Sublime Origins Belize 3.25 <NA> <NA>
## 11 <NA> Haiti NA haiti coffee Blue…
## 12 <NA> Myanmar NA sithar coffee… <NA>
## 13 <NA> Mexico NA globus coffee Green
## 14 <NA> Guatemala NA unex guatemal… Green
## 15 <NA> Colombia NA exportadora d… Blui…
## 16 <NA> Mexico NA union ramal s… Green
## 17 <NA> Mexico NA sustainable h… Green
## 18 <NA> United States (Hawaii) NA <NA> <NA>
## 19 <NA> Brazil NA ipanema coffe… Green
## 20 <NA> Tanzania, United Republic Of NA acacia hills … Green
While both data sets are similar, they have many differences. Column names are different, and the data in them don’t exactly match as well as I would have thought. I started by making each data set smaller because it’s easier to work with. I picked column names that sounded similar so joining the data would be easier. When I joined the data with the left joining function, I used the chocolate rating as my base data set. The results were less interesting than I would have hoped because I didn’t have any overlapping data between the two. I didn’t find any critical or distinctive characteristics between the two data sets in any of the joining functions I tried.