library(tidyverse)
library(ggrepel)
library(dslabs)
ds_theme_set()
The information we need for a given analysis may not be in just one table. Here we use a simple example to illustrate the general challenge of combining tables. Suppose we want to explore the relationship between population size for US states, which we have in this table using the murders
dataset from the dslabs
package:
data(murders)
head(murders)
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
and electoral votes (also from the dslabs
package), which we have in this one:
data(polls_us_election_2016)
head(results_us_election_2016)
## state electoral_votes clinton trump others
## 1 California 55 61.7 31.6 6.7
## 2 Texas 38 43.2 52.2 4.5
## 3 Florida 29 47.8 49.0 3.2
## 4 New York 29 59.0 36.5 4.5
## 5 Illinois 20 55.8 38.8 5.4
## 6 Pennsylvania 20 47.9 48.6 3.6
Notice that just joining these two tables together will not work since the order of the states is not quite the same:
identical(results_us_election_2016$state, murders$state)
## [1] FALSE
The join functions, described below, are designed to handle this challenge.
01/21/2022
The join
functions in the dplyr
package, which are based on SQL joins, make sure that the tables are combined so that matching rows are together.
A left join in R is a merge operation between two data frames where the merge returns all of the rows from one table (the left side) and any matching rows from the second table. A left join in R will NOT return values of the second table which do not already exist in the first table.
The general idea is that one needs to identify one or more columns that will serve to match the two tables. Then a new table with the combined information is returned. Note what happens if we join the two tables above by state
using left_join
:
tab <- left_join(murders, results_us_election_2016, by = "state")
tab %>% select(state, population, electoral_votes) %>%
head()
## state population electoral_votes
## 1 Alabama 4779736 9
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 Arkansas 2915918 6
## 5 California 37253956 55
## 6 Colorado 5029196 9
The data has been successfully joined and we can now, for example, make a plot to explore the relationship between population and electoral votes:
tab %>% ggplot(aes(population/10^6, electoral_votes, label = abb)) +
geom_point() +
geom_text_repel() +
scale_x_continuous(trans = "log2") +
scale_y_continuous(trans = "log2") +
geom_smooth(method = "lm", se = FALSE) +
xlab("Population (in millions)") +
ylab("Electoral Votes")
## `geom_smooth()` using formula 'y ~ x'
## Warning: ggrepel: 15 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
We see the relationship is close to linear with about 2 electoral votes for every million persons, but with smaller states getting a higher ratio. In practice, it is not always the case that each row in one table has a matching row in the other. For this reason we have several different ways to join. To illustrate this challenge, take subsets of the matrices above:
tab1 <- slice(murders, 1:6) %>%
select(state, population)
tab1
## state population
## 1 Alabama 4779736
## 2 Alaska 710231
## 3 Arizona 6392017
## 4 Arkansas 2915918
## 5 California 37253956
## 6 Colorado 5029196
so that we no longer have the same states in the two tables:
tab2<- slice(results_us_election_2016, c(1:3), 5, 14, 44) %>%
select(state, electoral_votes)
tab2
## state electoral_votes
## 1 California 55
## 2 Texas 38
## 3 Florida 29
## 4 Illinois 20
## 5 Arizona 11
## 6 Alaska 3
We will use these two tables as examples.
Suppose we want a table like tab1
but adding electoral votes to whatever states we have available. For this we use left join with tab1
as the first argument.
left_join(tab1, tab2)
## Joining, by = "state"
## state population electoral_votes
## 1 Alabama 4779736 NA
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 Arkansas 2915918 NA
## 5 California 37253956 55
## 6 Colorado 5029196 NA
Note that NA
s are added to the three states not appearing in tab2
. Also note that this function, as well as all the other joins, can receive the first arguments through the pipe:
tab1 %>% left_join(tab2)
## Joining, by = "state"
## state population electoral_votes
## 1 Alabama 4779736 NA
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 Arkansas 2915918 NA
## 5 California 37253956 55
## 6 Colorado 5029196 NA
If instead of a table like tab1
we want one like tab2
we can use right_join
:
tab1 %>% right_join(tab2)
## Joining, by = "state"
## state population electoral_votes
## 1 Alaska 710231 3
## 2 Arizona 6392017 11
## 3 California 37253956 55
## 4 Texas NA 38
## 5 Florida NA 29
## 6 Illinois NA 20
Notice that now the NAs are in the column coming from tab1
.
If we want to keep only the rows that have information in both tables we use inner join. You can think of this an intersection:
inner_join(tab1, tab2) # alaska, arizona , california
## Joining, by = "state"
## state population electoral_votes
## 1 Alaska 710231 3
## 2 Arizona 6392017 11
## 3 California 37253956 55
And if we want to keep all the rows, and fill the missing parts with NAs, we can use a full join. You can think of this as a union:
full_join(tab1, tab2)
## Joining, by = "state"
## state population electoral_votes
## 1 Alabama 4779736 NA
## 2 Alaska 710231 3
## 3 Arizona 6392017 11
## 4 Arkansas 2915918 NA
## 5 California 37253956 55
## 6 Colorado 5029196 NA
## 7 Texas NA 38
## 8 Florida NA 29
## 9 Illinois NA 20
The semi_join
let’s us keep the part of the first table for which we have information in the second. It does not add the columns of the second:
semi_join(tab1, tab2)
## Joining, by = "state"
## state population
## 1 Alaska 710231
## 2 Arizona 6392017
## 3 California 37253956
The function anti_join
is the opposite of semi_join
. It keeps the elements of the first table for which there is no information in the second:
anti_join(tab1, tab2)
## Joining, by = "state"
## state population
## 1 Alabama 4779736
## 2 Arkansas 2915918
## 3 Colorado 5029196
Although we have yet to use it in this course, another common way in which datasets are combined is by binding them. Unlike the join function, the binding functions do no try to match by a variable but rather just combine datasets. If the datasets don’t match by the appropriate dimensions one obtains an error.
The dplyr
function bind_cols binds two objects by making them columns in a tibble. For example, if we quickly want to make a data frame consisting of numbers we can use.
bind_cols(a = 1:3, b = 4:6)
## # A tibble: 3 x 2
## a b
## <int> <int>
## 1 1 4
## 2 2 5
## 3 3 6
This function requires that we assign names to the columns. Here we chose a
and b
. Note there is an R-base function cbind
that performs the same function but creates objects other than tibbles.
bind_cols
can also bind data frames. For example, here we break up the tab
data frame and then bind them back together:
tab1 <- tab[, 1:3]
tab2 <- tab[, 4:6]
tab3 <- tab[, 7:9]
new_tab <- bind_cols(tab1, tab2, tab3)
head(new_tab)
## state abb region population total electoral_votes clinton trump others
## 1 Alabama AL South 4779736 135 9 34.4 62.1 3.6
## 2 Alaska AK West 710231 19 3 36.6 51.3 12.2
## 3 Arizona AZ West 6392017 232 11 45.1 48.7 6.2
## 4 Arkansas AR South 2915918 93 6 33.7 60.6 5.8
## 5 California CA West 37253956 1257 55 61.7 31.6 6.7
## 6 Colorado CO West 5029196 65 9 48.2 43.3 8.6
The bind_rows
is similar but binds rows instead of columns.
tab1 <- tab[1:2,]
tab2 <- tab[3:4,]
bind_rows(tab1, tab2)
## state abb region population total electoral_votes clinton trump others
## 1 Alabama AL South 4779736 135 9 34.4 62.1 3.6
## 2 Alaska AK West 710231 19 3 36.6 51.3 12.2
## 3 Arizona AZ West 6392017 232 11 45.1 48.7 6.2
## 4 Arkansas AR South 2915918 93 6 33.7 60.6 5.8
This is based on an R-base function rbind
.
Another set of commands useful for combing are the set operators. When applied to vectors, these behave as their names suggest. However, if the tidyverse
, or more specifically, dplyr
is loaded, these functions can be used on data frames as opposed to just on vectors.
You can take intersections of vectors:
intersect(1:10, 6:15)
## [1] 6 7 8 9 10
intersect(c("a","b","c"), c("b","c","d"))
## [1] "b" "c"
But with dplyr
loaded we can also do this for tables having the same column names:
tab1 <- tab[1:5,]
tab2 <- tab[3:7,]
intersect(tab1, tab2)
## state abb region population total electoral_votes clinton trump others
## 1 Arizona AZ West 6392017 232 11 45.1 48.7 6.2
## 2 Arkansas AR South 2915918 93 6 33.7 60.6 5.8
## 3 California CA West 37253956 1257 55 61.7 31.6 6.7
Similarly union takes the union:
union(1:10, 6:15)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
union(c("a","b","c"), c("b","c","d"))
## [1] "a" "b" "c" "d"
But with dplyr
loaded we can also do this for tables having the same column names:
tab1 <- tab[1:5,]
tab2 <- tab[3:7,]
union(tab1, tab2)
## state abb region population total electoral_votes clinton trump
## 1 Alabama AL South 4779736 135 9 34.4 62.1
## 2 Alaska AK West 710231 19 3 36.6 51.3
## 3 Arizona AZ West 6392017 232 11 45.1 48.7
## 4 Arkansas AR South 2915918 93 6 33.7 60.6
## 5 California CA West 37253956 1257 55 61.7 31.6
## 6 Colorado CO West 5029196 65 9 48.2 43.3
## 7 Connecticut CT Northeast 3574097 97 7 54.6 40.9
## others
## 1 3.6
## 2 12.2
## 3 6.2
## 4 5.8
## 5 6.7
## 6 8.6
## 7 4.5
The set difference between a first and second argument can be obtained with setdiff
. Not unlike instersect
and union
, this function is not symmetric:
setdiff(1:10, 6:15)
## [1] 1 2 3 4 5
setdiff(6:15, 1:10)
## [1] 11 12 13 14 15
#setdiff(1:5, 4:8)
#setdiff(4:8, 1:5)
As with the others above, we can apply it to data frames:
tab1 <- tab[1:5,]
tab2 <- tab[3:7,]
setdiff(tab1, tab2)
## state abb region population total electoral_votes clinton trump others
## 1 Alabama AL South 4779736 135 9 34.4 62.1 3.6
## 2 Alaska AK West 710231 19 3 36.6 51.3 12.2
setequal
Finally, the function set_equal
tells us if two sets are the same, regardless of order. So
setequal(1:5, 1:6)
## [1] FALSE
but
setequal(1:5, 5:1)
## [1] TRUE
It also works when applied to data frames that are not equal regardless of order:
setequal(tab1, tab2)
## [1] FALSE