In this exercise, we will cover joins and other ways to tidy-transform our data.

library(dplyr)
library(tidyr)

Joins

  • Let us begin with two data sets a and b, which have a unique identifier variable in common, name
  • We can merge the data sets by matching on values of name

data set a

a = tibble(name=c("g&t","mimosa","rum and coke","calimocho","polar"), 
                      mixer=c("tonic","orange juice", "coke", "coke","seltzer"))
a 
## # A tibble: 5 x 2
##   name         mixer       
##   <chr>        <chr>       
## 1 g&t          tonic       
## 2 mimosa       orange juice
## 3 rum and coke coke        
## 4 calimocho    coke        
## 5 polar        seltzer

data set b

b = tibble(name=c("g&t","mimosa","rum and coke","calimocho","IPA","dark and stormy"), 
                   spirit=c("gin","sparkling wine", "rum", "red wine","beer","rum"))
b
## # A tibble: 6 x 2
##   name            spirit        
##   <chr>           <chr>         
## 1 g&t             gin           
## 2 mimosa          sparkling wine
## 3 rum and coke    rum           
## 4 calimocho       red wine      
## 5 IPA             beer          
## 6 dark and stormy rum

Let’s try out different joins:

inner_join()

  • Keep only the rows that have values of name in both data sets a and b
inner_join(a, b, by='name')
## # A tibble: 4 x 3
##   name         mixer        spirit        
##   <chr>        <chr>        <chr>         
## 1 g&t          tonic        gin           
## 2 mimosa       orange juice sparkling wine
## 3 rum and coke coke         rum           
## 4 calimocho    coke         red wine

left_join()

  • Keep all the rows in a
  • Find rows in data set b that have the same values of name as rows in data set a
  • Include the additional columns in b that don’t appear in a
  • Don’t include rows in b that have no matching value of name in b
left_join(a, b, by='name') 
## # A tibble: 5 x 3
##   name         mixer        spirit        
##   <chr>        <chr>        <chr>         
## 1 g&t          tonic        gin           
## 2 mimosa       orange juice sparkling wine
## 3 rum and coke coke         rum           
## 4 calimocho    coke         red wine      
## 5 polar        seltzer      <NA>

right_join()

  • Keep all the rows in b
  • Find rows in data set a that have the same values of name as rows in data set b
  • Include the extra columns in a that don’t appear in b
  • Don’t include rows in a that have no matching value of name in b
right_join(a, b, by='name')  
## # A tibble: 6 x 3
##   name            mixer        spirit        
##   <chr>           <chr>        <chr>         
## 1 g&t             tonic        gin           
## 2 mimosa          orange juice sparkling wine
## 3 rum and coke    coke         rum           
## 4 calimocho       coke         red wine      
## 5 IPA             <NA>         beer          
## 6 dark and stormy <NA>         rum

full_join()

  • Keep all the rows in a and b, even if they don’t have matching values in name
full_join(a, b, by='name')
## # A tibble: 7 x 3
##   name            mixer        spirit        
##   <chr>           <chr>        <chr>         
## 1 g&t             tonic        gin           
## 2 mimosa          orange juice sparkling wine
## 3 rum and coke    coke         rum           
## 4 calimocho       coke         red wine      
## 5 polar           seltzer      <NA>          
## 6 IPA             <NA>         beer          
## 7 dark and stormy <NA>         rum

We can also use joins for filtering.

semi_join()

  • For example, semi_join(a, b, by='name') keeps all the rows in a that have a matching value of name in b (it doesn’t create more columns with the information in b)
  • Note the difference between inner_join and semi_join (does not create more cols)
semi_join(a, b, by='name')
## # A tibble: 4 x 2
##   name         mixer       
##   <chr>        <chr>       
## 1 g&t          tonic       
## 2 mimosa       orange juice
## 3 rum and coke coke        
## 4 calimocho    coke

anti_join()

  • Keeps the rows in a that don’t have a matching value of name in b
anti_join(a, b, by='name')
## # A tibble: 1 x 2
##   name  mixer  
##   <chr> <chr>  
## 1 polar seltzer

This graphic summarizes the types of joins

More data tidying part I

  • Let us say we now add price information to the existing data sets:
a = a %>% mutate(price=c(8, 6, 7, 5, 5))
a
## # A tibble: 5 x 3
##   name         mixer        price
##   <chr>        <chr>        <dbl>
## 1 g&t          tonic            8
## 2 mimosa       orange juice     6
## 3 rum and coke coke             7
## 4 calimocho    coke             5
## 5 polar        seltzer          5
b = b  %>% mutate('$'=c(8, 6, 7, 5, 5, NA))
b
## # A tibble: 6 x 3
##   name            spirit           `$`
##   <chr>           <chr>          <dbl>
## 1 g&t             gin                8
## 2 mimosa          sparkling wine     6
## 3 rum and coke    rum                7
## 4 calimocho       red wine           5
## 5 IPA             beer               5
## 6 dark and stormy rum               NA
  • And then rejoin the new tibbles

  • First, rename the price variable in b

b = b %>% rename(price='$')
  • Then, redo the full_join, this time by name and price
full_join(a, b, by=c('name','price'))
## # A tibble: 7 x 4
##   name            mixer        price spirit        
##   <chr>           <chr>        <dbl> <chr>         
## 1 g&t             tonic            8 gin           
## 2 mimosa          orange juice     6 sparkling wine
## 3 rum and coke    coke             7 rum           
## 4 calimocho       coke             5 red wine      
## 5 polar           seltzer          5 <NA>          
## 6 IPA             <NA>             5 beer          
## 7 dark and stormy <NA>            NA rum
  • Note, if we don’t specify price in the by argument, we would get two price columns containing overlapping information
full_join(a, b, by=c('name'))
## # A tibble: 7 x 5
##   name            mixer        price.x spirit         price.y
##   <chr>           <chr>          <dbl> <chr>            <dbl>
## 1 g&t             tonic              8 gin                  8
## 2 mimosa          orange juice       6 sparkling wine       6
## 3 rum and coke    coke               7 rum                  7
## 4 calimocho       coke               5 red wine             5
## 5 polar           seltzer            5 <NA>                NA
## 6 IPA             <NA>              NA beer                 5
## 7 dark and stormy <NA>              NA rum                 NA
  • Reorder the columns so that spirit comes first, followed by mixer
join = full_join(a, b, by=c('name', 'price')) %>% 
  select(name, spirit, mixer, price)
  • Let us say we receive additional data that was missing for dark and stormy
ds = tibble(name='dark and stormy', spirit='ginger beer', price = 8)
ds 
## # A tibble: 1 x 3
##   name            spirit      price
##   <chr>           <chr>       <dbl>
## 1 dark and stormy ginger beer     8
  • Incorporate this new information by full_joining

  • Note - spirit.y and price.y come from the ds tibble we introduced above

full_join(join, ds, by='name')  
## # A tibble: 7 x 6
##   name            spirit.x       mixer        price.x spirit.y    price.y
##   <chr>           <chr>          <chr>          <dbl> <chr>         <dbl>
## 1 g&t             gin            tonic              8 <NA>             NA
## 2 mimosa          sparkling wine orange juice       6 <NA>             NA
## 3 rum and coke    rum            coke               7 <NA>             NA
## 4 calimocho       red wine       coke               5 <NA>             NA
## 5 polar           <NA>           seltzer            5 <NA>             NA
## 6 IPA             beer           <NA>               5 <NA>             NA
## 7 dark and stormy rum            <NA>              NA ginger beer       8
  • Combine price.x and price.y using the function coalesce(), and then remove price.x and price.y
join1 <- full_join(join, ds, by='name') %>%  
  mutate(price=coalesce(price.x,price.y)) %>% 
  select(-price.x, -price.y)
join1
## # A tibble: 7 x 5
##   name            spirit.x       mixer        spirit.y    price
##   <chr>           <chr>          <chr>        <chr>       <dbl>
## 1 g&t             gin            tonic        <NA>            8
## 2 mimosa          sparkling wine orange juice <NA>            6
## 3 rum and coke    rum            coke         <NA>            7
## 4 calimocho       red wine       coke         <NA>            5
## 5 polar           <NA>           seltzer      <NA>            5
## 6 IPA             beer           <NA>         <NA>            5
## 7 dark and stormy rum            <NA>         ginger beer     8
  • Now we turn our attention to combining spirit.x and spirit.y
  • First identify rows with non-missing values for spirit.x and spirit.y so we can decide how to handle them
join1 %>% filter(!(is.na(spirit.x)) & !(is.na(spirit.y)))
## # A tibble: 1 x 5
##   name            spirit.x mixer spirit.y    price
##   <chr>           <chr>    <chr> <chr>       <dbl>
## 1 dark and stormy rum      <NA>  ginger beer     8
  • There is one row containing non-missing values for spirit.x and spirit.y
  • Syntax is mutate(new_var = ifelse(condition, then do this, else do that))
  • Here, condition is that if both spirit.x and spirit.y have non-missing values, then the values in those two columns are concatenated using paste0(), and if not, the values are retained from spirit.x as is
join2 <- join1 %>% 
  mutate(spirit = ifelse(!(is.na(spirit.x)) & !(is.na(spirit.y)), paste0(spirit.x, ", ",spirit.y), spirit.x)) %>%
  replace(is.na(.),"") %>%
  select(-spirit.x, -spirit.y)
join2
## # A tibble: 7 x 4
##   name            mixer          price spirit            
##   <chr>           <chr>          <dbl> <chr>             
## 1 g&t             "tonic"            8 "gin"             
## 2 mimosa          "orange juice"     6 "sparkling wine"  
## 3 rum and coke    "coke"             7 "rum"             
## 4 calimocho       "coke"             5 "red wine"        
## 5 polar           "seltzer"          5 ""                
## 6 IPA             ""                 5 "beer"            
## 7 dark and stormy ""                 8 "rum, ginger beer"

More data tidying part II

  • Let us say we wish to calculate the total price for a mixed drink which is is equal to the sum of the prices of the components, the mixer and spirit
  • In this portion we wish to retain the prices from both data sets - a contains mixer prices and b spirit prices
a = tibble(name=c("g&t","mimosa","rum and coke","calimocho","polar"), 
                      mixer=c("tonic","orange juice", "coke", "coke","seltzer"))  %>% 
                      mutate(price=c(8, 6, 7, 5, 5))
a
## # A tibble: 5 x 3
##   name         mixer        price
##   <chr>        <chr>        <dbl>
## 1 g&t          tonic            8
## 2 mimosa       orange juice     6
## 3 rum and coke coke             7
## 4 calimocho    coke             5
## 5 polar        seltzer          5
b = tibble(name=c("g&t","mimosa","rum and coke","calimocho","IPA"), 
                    spirit=c("gin","sparkling wine", "rum", "red wine","beer")) %>% 
                    mutate(price=c(8, 6, 7, 5, 5))
b
## # A tibble: 5 x 3
##   name         spirit         price
##   <chr>        <chr>          <dbl>
## 1 g&t          gin                8
## 2 mimosa       sparkling wine     6
## 3 rum and coke rum                7
## 4 calimocho    red wine           5
## 5 IPA          beer               5
  • How can we merge the data and combine (add) prices?
  • We can start by full_joining by name
full_join(a, b, by='name')
## # A tibble: 6 x 5
##   name         mixer        price.x spirit         price.y
##   <chr>        <chr>          <dbl> <chr>            <dbl>
## 1 g&t          tonic              8 gin                  8
## 2 mimosa       orange juice       6 sparkling wine       6
## 3 rum and coke coke               7 rum                  7
## 4 calimocho    coke               5 red wine             5
## 5 polar        seltzer            5 <NA>                NA
## 6 IPA          <NA>              NA beer                 5
  • Unfortunately, a simple mutate instruction isn’t satisfactory because the sum of an NA + a number is equal to NA:
full_join(a, b, by='name') %>% mutate(total=price.x+price.y)
## # A tibble: 6 x 6
##   name         mixer        price.x spirit         price.y total
##   <chr>        <chr>          <dbl> <chr>            <dbl> <dbl>
## 1 g&t          tonic              8 gin                  8    16
## 2 mimosa       orange juice       6 sparkling wine       6    12
## 3 rum and coke coke               7 rum                  7    14
## 4 calimocho    coke               5 red wine             5    10
## 5 polar        seltzer            5 <NA>                NA    NA
## 6 IPA          <NA>              NA beer                 5    NA
  • We can solve this problem in different ways.
  • One way is using coalesce to replace the NAs by zeros in price.x and price.y, and then adding:
full_join(a, b, by='name') %>% 
  mutate(price.x=coalesce(price.x,0), 
         price.y=coalesce(price.y,0), 
         total=price.x+price.y) %>% 
  select(name, spirit, mixer, total)
## # A tibble: 6 x 4
##   name         spirit         mixer        total
##   <chr>        <chr>          <chr>        <dbl>
## 1 g&t          gin            tonic           16
## 2 mimosa       sparkling wine orange juice    12
## 3 rum and coke rum            coke            14
## 4 calimocho    red wine       coke            10
## 5 polar        <NA>           seltzer          5
## 6 IPA          beer           <NA>             5
  • Another way
join3 <- full_join(a, b, by='name') %>% 
  mutate(price.x = replace_na(price.x, 0),
         price.y = replace_na(price.y, 0),
         total=price.x+price.y) %>% 
  select(name, spirit, mixer, total)
join3
## # A tibble: 6 x 4
##   name         spirit         mixer        total
##   <chr>        <chr>          <chr>        <dbl>
## 1 g&t          gin            tonic           16
## 2 mimosa       sparkling wine orange juice    12
## 3 rum and coke rum            coke            14
## 4 calimocho    red wine       coke            10
## 5 polar        <NA>           seltzer          5
## 6 IPA          beer           <NA>             5
  • In general, the way that R deals with NAs can be a little frustrating, as these examples show.

  • We can use unite() to create a new column that merges the information in spirit and mixer

join3 %>% unite(drink, spirit, mixer, sep=' & ')  #`drink` is the new variable to be created from `spirit` and `mixer`
## # A tibble: 6 x 3
##   name         drink                         total
##   <chr>        <chr>                         <dbl>
## 1 g&t          gin & tonic                      16
## 2 mimosa       sparkling wine & orange juice    12
## 3 rum and coke rum & coke                       14
## 4 calimocho    red wine & coke                  10
## 5 polar        NA & seltzer                      5
## 6 IPA          beer & NA                         5
  • Note that the NA’s are retained which is not desirable
  • We need to substitute the NA’s with “nothing” using coalesce() before unite()
  • trimws() gets rid of any empty spaces at the start and end of the drink variable
join4 = join3 %>% 
  mutate(spirit=coalesce(spirit, "nothing"), 
         mixer=coalesce(mixer, "nothing")) %>%
  unite(drink, spirit, mixer, sep=' & ') %>% 
  mutate(drink=trimws(drink)) 
join4
## # A tibble: 6 x 3
##   name         drink                         total
##   <chr>        <chr>                         <dbl>
## 1 g&t          gin & tonic                      16
## 2 mimosa       sparkling wine & orange juice    12
## 3 rum and coke rum & coke                       14
## 4 calimocho    red wine & coke                  10
## 5 polar        nothing & seltzer                 5
## 6 IPA          beer & nothing                    5
  • We can use gsub() to substitute "& nothing" and "nothing &" with blank or empty space
  • Note - Be careful not to leave a space between the |
  • The following shows how we tidy-transformed drink into name
join5 = join4 %>% 
  mutate(drink1=gsub("& nothing|nothing &","",drink)) %>%
  mutate(name=trimws(drink1)) %>%
  select(drink, drink1, name, total)
join5
## # A tibble: 6 x 4
##   drink                   drink1                    name                   total
##   <chr>                   <chr>                     <chr>                  <dbl>
## 1 gin & tonic             "gin & tonic"             gin & tonic               16
## 2 sparkling wine & orang~ "sparkling wine & orange~ sparkling wine & oran~    12
## 3 rum & coke              "rum & coke"              rum & coke                14
## 4 red wine & coke         "red wine & coke"         red wine & coke           10
## 5 nothing & seltzer       " seltzer"                seltzer                    5
## 6 beer & nothing          "beer "                   beer                       5