In this exercise, we will cover joins and other ways to tidy-transform our data.
library(dplyr)
library(tidyr)
a and b, which have a unique identifier variable in common, namenameaa = tibble(name=c("g&t","mimosa","rum and coke","calimocho","polar"),
mixer=c("tonic","orange juice", "coke", "coke","seltzer"))
a
## # A tibble: 5 x 2
## name mixer
## <chr> <chr>
## 1 g&t tonic
## 2 mimosa orange juice
## 3 rum and coke coke
## 4 calimocho coke
## 5 polar seltzer
bb = tibble(name=c("g&t","mimosa","rum and coke","calimocho","IPA","dark and stormy"),
spirit=c("gin","sparkling wine", "rum", "red wine","beer","rum"))
b
## # A tibble: 6 x 2
## name spirit
## <chr> <chr>
## 1 g&t gin
## 2 mimosa sparkling wine
## 3 rum and coke rum
## 4 calimocho red wine
## 5 IPA beer
## 6 dark and stormy rum
Let’s try out different joins:
inner_join()name in both data sets a and binner_join(a, b, by='name')
## # A tibble: 4 x 3
## name mixer spirit
## <chr> <chr> <chr>
## 1 g&t tonic gin
## 2 mimosa orange juice sparkling wine
## 3 rum and coke coke rum
## 4 calimocho coke red wine
left_join()ab that have the same values of name as rows in data set ab that don’t appear in ab that have no matching value of name in bleft_join(a, b, by='name')
## # A tibble: 5 x 3
## name mixer spirit
## <chr> <chr> <chr>
## 1 g&t tonic gin
## 2 mimosa orange juice sparkling wine
## 3 rum and coke coke rum
## 4 calimocho coke red wine
## 5 polar seltzer <NA>
right_join()ba that have the same values of name as rows in data set ba that don’t appear in ba that have no matching value of name in bright_join(a, b, by='name')
## # A tibble: 6 x 3
## name mixer spirit
## <chr> <chr> <chr>
## 1 g&t tonic gin
## 2 mimosa orange juice sparkling wine
## 3 rum and coke coke rum
## 4 calimocho coke red wine
## 5 IPA <NA> beer
## 6 dark and stormy <NA> rum
full_join()a and b, even if they don’t have matching values in namefull_join(a, b, by='name')
## # A tibble: 7 x 3
## name mixer spirit
## <chr> <chr> <chr>
## 1 g&t tonic gin
## 2 mimosa orange juice sparkling wine
## 3 rum and coke coke rum
## 4 calimocho coke red wine
## 5 polar seltzer <NA>
## 6 IPA <NA> beer
## 7 dark and stormy <NA> rum
We can also use joins for filtering.
semi_join()semi_join(a, b, by='name') keeps all the rows in a that have a matching value of name in b (it doesn’t create more columns with the information in b)semi_join(a, b, by='name')
## # A tibble: 4 x 2
## name mixer
## <chr> <chr>
## 1 g&t tonic
## 2 mimosa orange juice
## 3 rum and coke coke
## 4 calimocho coke
anti_join()a that don’t have a matching value of name in banti_join(a, b, by='name')
## # A tibble: 1 x 2
## name mixer
## <chr> <chr>
## 1 polar seltzer
This graphic summarizes the types of joins
a = a %>% mutate(price=c(8, 6, 7, 5, 5))
a
## # A tibble: 5 x 3
## name mixer price
## <chr> <chr> <dbl>
## 1 g&t tonic 8
## 2 mimosa orange juice 6
## 3 rum and coke coke 7
## 4 calimocho coke 5
## 5 polar seltzer 5
b = b %>% mutate('$'=c(8, 6, 7, 5, 5, NA))
b
## # A tibble: 6 x 3
## name spirit `$`
## <chr> <chr> <dbl>
## 1 g&t gin 8
## 2 mimosa sparkling wine 6
## 3 rum and coke rum 7
## 4 calimocho red wine 5
## 5 IPA beer 5
## 6 dark and stormy rum NA
And then rejoin the new tibbles
First, rename the price variable in b
b = b %>% rename(price='$')
full_join, this time by name and pricefull_join(a, b, by=c('name','price'))
## # A tibble: 7 x 4
## name mixer price spirit
## <chr> <chr> <dbl> <chr>
## 1 g&t tonic 8 gin
## 2 mimosa orange juice 6 sparkling wine
## 3 rum and coke coke 7 rum
## 4 calimocho coke 5 red wine
## 5 polar seltzer 5 <NA>
## 6 IPA <NA> 5 beer
## 7 dark and stormy <NA> NA rum
price in the by argument, we would get two price columns containing overlapping informationfull_join(a, b, by=c('name'))
## # A tibble: 7 x 5
## name mixer price.x spirit price.y
## <chr> <chr> <dbl> <chr> <dbl>
## 1 g&t tonic 8 gin 8
## 2 mimosa orange juice 6 sparkling wine 6
## 3 rum and coke coke 7 rum 7
## 4 calimocho coke 5 red wine 5
## 5 polar seltzer 5 <NA> NA
## 6 IPA <NA> NA beer 5
## 7 dark and stormy <NA> NA rum NA
spirit comes first, followed by mixerjoin = full_join(a, b, by=c('name', 'price')) %>%
select(name, spirit, mixer, price)
dark and stormyds = tibble(name='dark and stormy', spirit='ginger beer', price = 8)
ds
## # A tibble: 1 x 3
## name spirit price
## <chr> <chr> <dbl>
## 1 dark and stormy ginger beer 8
Incorporate this new information by full_joining
Note - spirit.y and price.y come from the ds tibble we introduced above
full_join(join, ds, by='name')
## # A tibble: 7 x 6
## name spirit.x mixer price.x spirit.y price.y
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 g&t gin tonic 8 <NA> NA
## 2 mimosa sparkling wine orange juice 6 <NA> NA
## 3 rum and coke rum coke 7 <NA> NA
## 4 calimocho red wine coke 5 <NA> NA
## 5 polar <NA> seltzer 5 <NA> NA
## 6 IPA beer <NA> 5 <NA> NA
## 7 dark and stormy rum <NA> NA ginger beer 8
price.x and price.y using the function coalesce(), and then remove price.x and price.yjoin1 <- full_join(join, ds, by='name') %>%
mutate(price=coalesce(price.x,price.y)) %>%
select(-price.x, -price.y)
join1
## # A tibble: 7 x 5
## name spirit.x mixer spirit.y price
## <chr> <chr> <chr> <chr> <dbl>
## 1 g&t gin tonic <NA> 8
## 2 mimosa sparkling wine orange juice <NA> 6
## 3 rum and coke rum coke <NA> 7
## 4 calimocho red wine coke <NA> 5
## 5 polar <NA> seltzer <NA> 5
## 6 IPA beer <NA> <NA> 5
## 7 dark and stormy rum <NA> ginger beer 8
spirit.x and spirit.yspirit.x and spirit.y so we can decide how to handle themjoin1 %>% filter(!(is.na(spirit.x)) & !(is.na(spirit.y)))
## # A tibble: 1 x 5
## name spirit.x mixer spirit.y price
## <chr> <chr> <chr> <chr> <dbl>
## 1 dark and stormy rum <NA> ginger beer 8
spirit.x and spirit.ymutate(new_var = ifelse(condition, then do this, else do that))spirit.x and spirit.y have non-missing values, then the values in those two columns are concatenated using paste0(), and if not, the values are retained from spirit.x as isjoin2 <- join1 %>%
mutate(spirit = ifelse(!(is.na(spirit.x)) & !(is.na(spirit.y)), paste0(spirit.x, ", ",spirit.y), spirit.x)) %>%
replace(is.na(.),"") %>%
select(-spirit.x, -spirit.y)
join2
## # A tibble: 7 x 4
## name mixer price spirit
## <chr> <chr> <dbl> <chr>
## 1 g&t "tonic" 8 "gin"
## 2 mimosa "orange juice" 6 "sparkling wine"
## 3 rum and coke "coke" 7 "rum"
## 4 calimocho "coke" 5 "red wine"
## 5 polar "seltzer" 5 ""
## 6 IPA "" 5 "beer"
## 7 dark and stormy "" 8 "rum, ginger beer"
mixer and spirita contains mixer prices and b spirit pricesa = tibble(name=c("g&t","mimosa","rum and coke","calimocho","polar"),
mixer=c("tonic","orange juice", "coke", "coke","seltzer")) %>%
mutate(price=c(8, 6, 7, 5, 5))
a
## # A tibble: 5 x 3
## name mixer price
## <chr> <chr> <dbl>
## 1 g&t tonic 8
## 2 mimosa orange juice 6
## 3 rum and coke coke 7
## 4 calimocho coke 5
## 5 polar seltzer 5
b = tibble(name=c("g&t","mimosa","rum and coke","calimocho","IPA"),
spirit=c("gin","sparkling wine", "rum", "red wine","beer")) %>%
mutate(price=c(8, 6, 7, 5, 5))
b
## # A tibble: 5 x 3
## name spirit price
## <chr> <chr> <dbl>
## 1 g&t gin 8
## 2 mimosa sparkling wine 6
## 3 rum and coke rum 7
## 4 calimocho red wine 5
## 5 IPA beer 5
full_joining by namefull_join(a, b, by='name')
## # A tibble: 6 x 5
## name mixer price.x spirit price.y
## <chr> <chr> <dbl> <chr> <dbl>
## 1 g&t tonic 8 gin 8
## 2 mimosa orange juice 6 sparkling wine 6
## 3 rum and coke coke 7 rum 7
## 4 calimocho coke 5 red wine 5
## 5 polar seltzer 5 <NA> NA
## 6 IPA <NA> NA beer 5
mutate instruction isn’t satisfactory because the sum of an NA + a number is equal to NA:full_join(a, b, by='name') %>% mutate(total=price.x+price.y)
## # A tibble: 6 x 6
## name mixer price.x spirit price.y total
## <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 g&t tonic 8 gin 8 16
## 2 mimosa orange juice 6 sparkling wine 6 12
## 3 rum and coke coke 7 rum 7 14
## 4 calimocho coke 5 red wine 5 10
## 5 polar seltzer 5 <NA> NA NA
## 6 IPA <NA> NA beer 5 NA
coalesce to replace the NAs by zeros in price.x and price.y, and then adding:full_join(a, b, by='name') %>%
mutate(price.x=coalesce(price.x,0),
price.y=coalesce(price.y,0),
total=price.x+price.y) %>%
select(name, spirit, mixer, total)
## # A tibble: 6 x 4
## name spirit mixer total
## <chr> <chr> <chr> <dbl>
## 1 g&t gin tonic 16
## 2 mimosa sparkling wine orange juice 12
## 3 rum and coke rum coke 14
## 4 calimocho red wine coke 10
## 5 polar <NA> seltzer 5
## 6 IPA beer <NA> 5
join3 <- full_join(a, b, by='name') %>%
mutate(price.x = replace_na(price.x, 0),
price.y = replace_na(price.y, 0),
total=price.x+price.y) %>%
select(name, spirit, mixer, total)
join3
## # A tibble: 6 x 4
## name spirit mixer total
## <chr> <chr> <chr> <dbl>
## 1 g&t gin tonic 16
## 2 mimosa sparkling wine orange juice 12
## 3 rum and coke rum coke 14
## 4 calimocho red wine coke 10
## 5 polar <NA> seltzer 5
## 6 IPA beer <NA> 5
In general, the way that R deals with NAs can be a little frustrating, as these examples show.
We can use unite() to create a new column that merges the information in spirit and mixer
join3 %>% unite(drink, spirit, mixer, sep=' & ') #`drink` is the new variable to be created from `spirit` and `mixer`
## # A tibble: 6 x 3
## name drink total
## <chr> <chr> <dbl>
## 1 g&t gin & tonic 16
## 2 mimosa sparkling wine & orange juice 12
## 3 rum and coke rum & coke 14
## 4 calimocho red wine & coke 10
## 5 polar NA & seltzer 5
## 6 IPA beer & NA 5
coalesce() before unite()trimws() gets rid of any empty spaces at the start and end of the drink variablejoin4 = join3 %>%
mutate(spirit=coalesce(spirit, "nothing"),
mixer=coalesce(mixer, "nothing")) %>%
unite(drink, spirit, mixer, sep=' & ') %>%
mutate(drink=trimws(drink))
join4
## # A tibble: 6 x 3
## name drink total
## <chr> <chr> <dbl>
## 1 g&t gin & tonic 16
## 2 mimosa sparkling wine & orange juice 12
## 3 rum and coke rum & coke 14
## 4 calimocho red wine & coke 10
## 5 polar nothing & seltzer 5
## 6 IPA beer & nothing 5
gsub() to substitute "& nothing" and "nothing &" with blank or empty space|drink into namejoin5 = join4 %>%
mutate(drink1=gsub("& nothing|nothing &","",drink)) %>%
mutate(name=trimws(drink1)) %>%
select(drink, drink1, name, total)
join5
## # A tibble: 6 x 4
## drink drink1 name total
## <chr> <chr> <chr> <dbl>
## 1 gin & tonic "gin & tonic" gin & tonic 16
## 2 sparkling wine & orang~ "sparkling wine & orange~ sparkling wine & oran~ 12
## 3 rum & coke "rum & coke" rum & coke 14
## 4 red wine & coke "red wine & coke" red wine & coke 10
## 5 nothing & seltzer " seltzer" seltzer 5
## 6 beer & nothing "beer " beer 5