Albert Y. Kim
Monday 2016/2/22
Imagine we have two data frames x and y:
x1 x2
1 A 1
2 B 2
3 C 3
x1 x3
1 A TRUE
2 B FALSE
3 D TRUE
We want to join them along the categorical variable x1and end up with a
new data frame that has all three variables:
inner_join(x, y, by = "x1")
x1 x2 x3
1 A 1 TRUE
2 B 2 FALSE
full_join(x, y, by = "x1")
x1 x2 x3
1 A 1 TRUE
2 B 2 FALSE
3 C 3 NA
4 D NA TRUE
dplyr's operations to join data sets are inspired by SQL (Structured Query
Language), which is used to query large databases.NA (R code for missing value) is
inserted.In our example, both data frames x and y the had the “join by” variable
named the same: x1. Hence:
inner_join(x, y, by = "x1")
If the “join by” variable has different names in the data frames. Ex:
a in data frame xb in data frame ythen run: inner_join(x, y, by = c("a"="b")).
Note it is
by = c("a"="b") and notby = c("a"=="b")See by in the “Argument” section of the help file ?join.
?join for a list of the different types of joinsAt the beginning of Lec03.R, I load all necessary packages at the beginning.
This is good practice people know which packages they need installed. For example, I didn't know I needed a particular package before I boarded a 4 hour flight because the package loading was buried in the file.