Join and Tidy Data

In this exercise, we will cover joins and other ways to tidy-transform our data.

library(dplyr)
library(tidyr)

Joins

Let us begin with two data sets a and b, which have a unique identifier variable in common, name
We can merge the data sets by matching on values of name

data set `a`

a = tibble(name=c("g&t","mimosa","rum and coke","calimocho","polar"), 
                      mixer=c("tonic","orange juice", "coke", "coke","seltzer"))
a

## # A tibble: 5 x 2
##   name         mixer       
##   <chr>        <chr>       
## 1 g&t          tonic       
## 2 mimosa       orange juice
## 3 rum and coke coke        
## 4 calimocho    coke        
## 5 polar        seltzer

data set `b`

b = tibble(name=c("g&t","mimosa","rum and coke","calimocho","IPA","dark and stormy"), 
                   spirit=c("gin","sparkling wine", "rum", "red wine","beer","rum"))
b

## # A tibble: 6 x 2
##   name            spirit        
##   <chr>           <chr>         
## 1 g&t             gin           
## 2 mimosa          sparkling wine
## 3 rum and coke    rum           
## 4 calimocho       red wine      
## 5 IPA             beer          
## 6 dark and stormy rum

Let’s try out different joins:

`inner_join()`

Keep only the rows that have values of name in both data sets a and b

inner_join(a, b, by='name')

## # A tibble: 4 x 3
##   name         mixer        spirit        
##   <chr>        <chr>        <chr>         
## 1 g&t          tonic        gin           
## 2 mimosa       orange juice sparkling wine
## 3 rum and coke coke         rum           
## 4 calimocho    coke         red wine

`left_join()`

Keep all the rows in a
Find rows in data set b that have the same values of name as rows in data set a
Include the additional columns in b that don’t appear in a
Don’t include rows in b that have no matching value of name in b

left_join(a, b, by='name')

## # A tibble: 5 x 3
##   name         mixer        spirit        
##   <chr>        <chr>        <chr>         
## 1 g&t          tonic        gin           
## 2 mimosa       orange juice sparkling wine
## 3 rum and coke coke         rum           
## 4 calimocho    coke         red wine      
## 5 polar        seltzer      <NA>

`right_join()`

Keep all the rows in b
Find rows in data set a that have the same values of name as rows in data set b
Include the extra columns in a that don’t appear in b
Don’t include rows in a that have no matching value of name in b

right_join(a, b, by='name')

## # A tibble: 6 x 3
##   name            mixer        spirit        
##   <chr>           <chr>        <chr>         
## 1 g&t             tonic        gin           
## 2 mimosa          orange juice sparkling wine
## 3 rum and coke    coke         rum           
## 4 calimocho       coke         red wine      
## 5 IPA             <NA>         beer          
## 6 dark and stormy <NA>         rum

`full_join()`

Keep all the rows in a and b, even if they don’t have matching values in name

full_join(a, b, by='name')

## # A tibble: 7 x 3
##   name            mixer        spirit        
##   <chr>           <chr>        <chr>         
## 1 g&t             tonic        gin           
## 2 mimosa          orange juice sparkling wine
## 3 rum and coke    coke         rum           
## 4 calimocho       coke         red wine      
## 5 polar           seltzer      <NA>          
## 6 IPA             <NA>         beer          
## 7 dark and stormy <NA>         rum

We can also use joins for filtering.

`semi_join()`

For example, semi_join(a, b, by='name') keeps all the rows in a that have a matching value of name in b (it doesn’t create more columns with the information in b)
Note the difference between inner_join and semi_join (does not create more cols)

semi_join(a, b, by='name')

## # A tibble: 4 x 2
##   name         mixer       
##   <chr>        <chr>       
## 1 g&t          tonic       
## 2 mimosa       orange juice
## 3 rum and coke coke        
## 4 calimocho    coke

`anti_join()`

Keeps the rows in a that don’t have a matching value of name in b

anti_join(a, b, by='name')

## # A tibble: 1 x 2
##   name  mixer  
##   <chr> <chr>  
## 1 polar seltzer

This graphic summarizes the types of joins

More data tidying part I

Let us say we now add price information to the existing data sets:

a = a %>% mutate(price=c(8, 6, 7, 5, 5))
a

## # A tibble: 5 x 3
##   name         mixer        price
##   <chr>        <chr>        <dbl>
## 1 g&t          tonic            8
## 2 mimosa       orange juice     6
## 3 rum and coke coke             7
## 4 calimocho    coke             5
## 5 polar        seltzer          5

b = b  %>% mutate('$'=c(8, 6, 7, 5, 5, NA))
b

## # A tibble: 6 x 3
##   name            spirit           `$`
##   <chr>           <chr>          <dbl>
## 1 g&t             gin                8
## 2 mimosa          sparkling wine     6
## 3 rum and coke    rum                7
## 4 calimocho       red wine           5
## 5 IPA             beer               5
## 6 dark and stormy rum               NA

And then rejoin the new tibbles
First, rename the price variable in b

b = b %>% rename(price='$')

Then, redo the full_join, this time by name and price

full_join(a, b, by=c('name','price'))

## # A tibble: 7 x 4
##   name            mixer        price spirit        
##   <chr>           <chr>        <dbl> <chr>         
## 1 g&t             tonic            8 gin           
## 2 mimosa          orange juice     6 sparkling wine
## 3 rum and coke    coke             7 rum           
## 4 calimocho       coke             5 red wine      
## 5 polar           seltzer          5 <NA>          
## 6 IPA             <NA>             5 beer          
## 7 dark and stormy <NA>            NA rum

Note, if we don’t specify price in the by argument, we would get two price columns containing overlapping information

full_join(a, b, by=c('name'))

## # A tibble: 7 x 5
##   name            mixer        price.x spirit         price.y
##   <chr>           <chr>          <dbl> <chr>            <dbl>
## 1 g&t             tonic              8 gin                  8
## 2 mimosa          orange juice       6 sparkling wine       6
## 3 rum and coke    coke               7 rum                  7
## 4 calimocho       coke               5 red wine             5
## 5 polar           seltzer            5 <NA>                NA
## 6 IPA             <NA>              NA beer                 5
## 7 dark and stormy <NA>              NA rum                 NA

Reorder the columns so that spirit comes first, followed by mixer

join = full_join(a, b, by=c('name', 'price')) %>% 
  select(name, spirit, mixer, price)

Let us say we receive additional data that was missing for dark and stormy

ds = tibble(name='dark and stormy', spirit='ginger beer', price = 8)
ds

## # A tibble: 1 x 3
##   name            spirit      price
##   <chr>           <chr>       <dbl>
## 1 dark and stormy ginger beer     8

Incorporate this new information by full_joining
Note - spirit.y and price.y come from the ds tibble we introduced above

full_join(join, ds, by='name')

## # A tibble: 7 x 6
##   name            spirit.x       mixer        price.x spirit.y    price.y
##   <chr>           <chr>          <chr>          <dbl> <chr>         <dbl>
## 1 g&t             gin            tonic              8 <NA>             NA
## 2 mimosa          sparkling wine orange juice       6 <NA>             NA
## 3 rum and coke    rum            coke               7 <NA>             NA
## 4 calimocho       red wine       coke               5 <NA>             NA
## 5 polar           <NA>           seltzer            5 <NA>             NA
## 6 IPA             beer           <NA>               5 <NA>             NA
## 7 dark and stormy rum            <NA>              NA ginger beer       8

Combine price.x and price.y using the function coalesce(), and then remove price.x and price.y

join1 <- full_join(join, ds, by='name') %>%  
  mutate(price=coalesce(price.x,price.y)) %>% 
  select(-price.x, -price.y)
join1

## # A tibble: 7 x 5
##   name            spirit.x       mixer        spirit.y    price
##   <chr>           <chr>          <chr>        <chr>       <dbl>
## 1 g&t             gin            tonic        <NA>            8
## 2 mimosa          sparkling wine orange juice <NA>            6
## 3 rum and coke    rum            coke         <NA>            7
## 4 calimocho       red wine       coke         <NA>            5
## 5 polar           <NA>           seltzer      <NA>            5
## 6 IPA             beer           <NA>         <NA>            5
## 7 dark and stormy rum            <NA>         ginger beer     8

Now we turn our attention to combining spirit.x and spirit.y
First identify rows with non-missing values for spirit.x and spirit.y so we can decide how to handle them

join1 %>% filter(!(is.na(spirit.x)) & !(is.na(spirit.y)))

## # A tibble: 1 x 5
##   name            spirit.x mixer spirit.y    price
##   <chr>           <chr>    <chr> <chr>       <dbl>
## 1 dark and stormy rum      <NA>  ginger beer     8

There is one row containing non-missing values for spirit.x and spirit.y
Syntax is mutate(new_var = ifelse(condition, then do this, else do that))
Here, condition is that if both spirit.x and spirit.y have non-missing values, then the values in those two columns are concatenated using paste0(), and if not, the values are retained from spirit.x as is

join2 <- join1 %>% 
  mutate(spirit = ifelse(!(is.na(spirit.x)) & !(is.na(spirit.y)), paste0(spirit.x, ", ",spirit.y), spirit.x)) %>%
  replace(is.na(.),"") %>%
  select(-spirit.x, -spirit.y)
join2

## # A tibble: 7 x 4
##   name            mixer          price spirit            
##   <chr>           <chr>          <dbl> <chr>             
## 1 g&t             "tonic"            8 "gin"             
## 2 mimosa          "orange juice"     6 "sparkling wine"  
## 3 rum and coke    "coke"             7 "rum"             
## 4 calimocho       "coke"             5 "red wine"        
## 5 polar           "seltzer"          5 ""                
## 6 IPA             ""                 5 "beer"            
## 7 dark and stormy ""                 8 "rum, ginger beer"

More data tidying part II

Let us say we wish to calculate the total price for a mixed drink which is is equal to the sum of the prices of the components, the mixer and spirit
In this portion we wish to retain the prices from both data sets - a contains mixer prices and b spirit prices

a = tibble(name=c("g&t","mimosa","rum and coke","calimocho","polar"), 
                      mixer=c("tonic","orange juice", "coke", "coke","seltzer"))  %>% 
                      mutate(price=c(8, 6, 7, 5, 5))
a

## # A tibble: 5 x 3
##   name         mixer        price
##   <chr>        <chr>        <dbl>
## 1 g&t          tonic            8
## 2 mimosa       orange juice     6
## 3 rum and coke coke             7
## 4 calimocho    coke             5
## 5 polar        seltzer          5

b = tibble(name=c("g&t","mimosa","rum and coke","calimocho","IPA"), 
                    spirit=c("gin","sparkling wine", "rum", "red wine","beer")) %>% 
                    mutate(price=c(8, 6, 7, 5, 5))
b

## # A tibble: 5 x 3
##   name         spirit         price
##   <chr>        <chr>          <dbl>
## 1 g&t          gin                8
## 2 mimosa       sparkling wine     6
## 3 rum and coke rum                7
## 4 calimocho    red wine           5
## 5 IPA          beer               5

How can we merge the data and combine (add) prices?
We can start by full_joining by name

full_join(a, b, by='name')

## # A tibble: 6 x 5
##   name         mixer        price.x spirit         price.y
##   <chr>        <chr>          <dbl> <chr>            <dbl>
## 1 g&t          tonic              8 gin                  8
## 2 mimosa       orange juice       6 sparkling wine       6
## 3 rum and coke coke               7 rum                  7
## 4 calimocho    coke               5 red wine             5
## 5 polar        seltzer            5 <NA>                NA
## 6 IPA          <NA>              NA beer                 5

Unfortunately, a simple mutate instruction isn’t satisfactory because the sum of an NA + a number is equal to NA:

full_join(a, b, by='name') %>% mutate(total=price.x+price.y)

## # A tibble: 6 x 6
##   name         mixer        price.x spirit         price.y total
##   <chr>        <chr>          <dbl> <chr>            <dbl> <dbl>
## 1 g&t          tonic              8 gin                  8    16
## 2 mimosa       orange juice       6 sparkling wine       6    12
## 3 rum and coke coke               7 rum                  7    14
## 4 calimocho    coke               5 red wine             5    10
## 5 polar        seltzer            5 <NA>                NA    NA
## 6 IPA          <NA>              NA beer                 5    NA

We can solve this problem in different ways.
One way is using coalesce to replace the NAs by zeros in price.x and price.y, and then adding:

full_join(a, b, by='name') %>% 
  mutate(price.x=coalesce(price.x,0), 
         price.y=coalesce(price.y,0), 
         total=price.x+price.y) %>% 
  select(name, spirit, mixer, total)

## # A tibble: 6 x 4
##   name         spirit         mixer        total
##   <chr>        <chr>          <chr>        <dbl>
## 1 g&t          gin            tonic           16
## 2 mimosa       sparkling wine orange juice    12
## 3 rum and coke rum            coke            14
## 4 calimocho    red wine       coke            10
## 5 polar        <NA>           seltzer          5
## 6 IPA          beer           <NA>             5

Another way

join3 <- full_join(a, b, by='name') %>% 
  mutate(price.x = replace_na(price.x, 0),
         price.y = replace_na(price.y, 0),
         total=price.x+price.y) %>% 
  select(name, spirit, mixer, total)
join3

## # A tibble: 6 x 4
##   name         spirit         mixer        total
##   <chr>        <chr>          <chr>        <dbl>
## 1 g&t          gin            tonic           16
## 2 mimosa       sparkling wine orange juice    12
## 3 rum and coke rum            coke            14
## 4 calimocho    red wine       coke            10
## 5 polar        <NA>           seltzer          5
## 6 IPA          beer           <NA>             5

In general, the way that R deals with NAs can be a little frustrating, as these examples show.
We can use unite() to create a new column that merges the information in spirit and mixer

join3 %>% unite(drink, spirit, mixer, sep=' & ')  #`drink` is the new variable to be created from `spirit` and `mixer`

## # A tibble: 6 x 3
##   name         drink                         total
##   <chr>        <chr>                         <dbl>
## 1 g&t          gin & tonic                      16
## 2 mimosa       sparkling wine & orange juice    12
## 3 rum and coke rum & coke                       14
## 4 calimocho    red wine & coke                  10
## 5 polar        NA & seltzer                      5
## 6 IPA          beer & NA                         5

Note that the NA’s are retained which is not desirable
We need to substitute the NA’s with “nothing” using coalesce() before unite()
trimws() gets rid of any empty spaces at the start and end of the drink variable

join4 = join3 %>% 
  mutate(spirit=coalesce(spirit, "nothing"), 
         mixer=coalesce(mixer, "nothing")) %>%
  unite(drink, spirit, mixer, sep=' & ') %>% 
  mutate(drink=trimws(drink)) 
join4

## # A tibble: 6 x 3
##   name         drink                         total
##   <chr>        <chr>                         <dbl>
## 1 g&t          gin & tonic                      16
## 2 mimosa       sparkling wine & orange juice    12
## 3 rum and coke rum & coke                       14
## 4 calimocho    red wine & coke                  10
## 5 polar        nothing & seltzer                 5
## 6 IPA          beer & nothing                    5

We can use gsub() to substitute "& nothing" and "nothing &" with blank or empty space
Note - Be careful not to leave a space between the |
The following shows how we tidy-transformed drink into name

join5 = join4 %>% 
  mutate(drink1=gsub("& nothing|nothing &","",drink)) %>%
  mutate(name=trimws(drink1)) %>%
  select(drink, drink1, name, total)
join5

## # A tibble: 6 x 4
##   drink                   drink1                    name                   total
##   <chr>                   <chr>                     <chr>                  <dbl>
## 1 gin & tonic             "gin & tonic"             gin & tonic               16
## 2 sparkling wine & orang~ "sparkling wine & orange~ sparkling wine & oran~    12
## 3 rum & coke              "rum & coke"              rum & coke                14
## 4 red wine & coke         "red wine & coke"         red wine & coke           10
## 5 nothing & seltzer       " seltzer"                seltzer                    5
## 6 beer & nothing          "beer "                   beer                       5

Join and Tidy Data

Gina Moreno

12/10/2018

Joins

data set `a`

data set `b`

`inner_join()`

`left_join()`

`right_join()`

`full_join()`

`semi_join()`

`anti_join()`

More data tidying part I

More data tidying part II

Join and Tidy Data

Gina Moreno

12/10/2018

Joins

data set a

data set b

inner_join()

left_join()

right_join()

full_join()

semi_join()

anti_join()

More data tidying part I

More data tidying part II

data set `a`

data set `b`

`inner_join()`

`left_join()`

`right_join()`

`full_join()`

`semi_join()`

`anti_join()`