Greetings , hope you will enjoy SQL JOINS
coming from an
R afficionado
who has dealt mainly with
R joins
. These notes are based on how much i have
understood and spent some time looking for pictures
from
the internet to aid in the presentation. > The presentation assumes
you are already familiar with a bit of SQL
Mutating joins allow you to take two different tables and combine the variables from both tables. This requires that each table have a column relating the tables to one another (i.e. a unique identifier). This unique identifier is used to match observations between the tables.
However, when combining tables, there are a number of different ways in which the tables can be joined:
x
and y
x
y
x
or y
Let’s break down exactly what we mean by this using just a small toy
example from the artists
and albums
tables
from a company
database. Here you see three rows from the
artists
table and four rows from the albums
table.
RSTUDIO
library(tidyverse)
library(odbc)
library(DBI)
library(RSQLite)
fake datasets
to assist in explaining and these are
join_df1.csv
and join_df2.csv
datasets# read in example csv files (join_df1 and join_df2)
<- read.csv("join_df1.csv") |>
join_df1 rename(A=`ï..A`)
<- read.csv("join_df2.csv")|>
join_df2 rename(A=`ï..A`)
<-dbConnect(RSQLite::SQLite(), ":memory:")
joincopy_to(join,join_df1)
copy_to(join,join_df2)
dbGetQuery(join,"SELECT * FROM join_df1")
this data has 8 rows and 3 columns
dbGetQuery(join,"SELECT * FROM join_df2")
this has 7 rows and 2 columns
A
as our
unique identifier
generally the following step show how a join is done in SQL
SELECT
t1.comon_column1,
t1.common_column2,
…FROM
table1 AS
t1<type> JOIN
table2 AS
t2ON
t1.common_unique_column =
t2.common_unique_column;
When talking about inner joins, we are only going to keep an
observation if it is found in all of the tables we’re combining. Here,
we’re combining the tables based on the ArtistId
column. In
our dummy example, there are only two artists that are found in
both tables. These are highlighted in green and will be the
rows used to join the two tables. Then, once the inner join happens,
only these artists’ data will be included after the inner join.
when doing an inner join
, data from any observation
found in all the tables being joined are included in the output. Here,
ArtistId
s “1” and “2” are in both the artists
and albums
tables. Thus, those will be the only
ArtistId
s in the output from the inner join.
And, since it’s a mutating join, our new table will have information
from both tables! We now have ArtistId
, Name
,
AlbumId
, and Title
in a single table!
We’ve joined the two tables, based on the column
ArtistId
!
ACTION
dbGetQuery(join,"SELECT *
FROM join_df1
INNER JOIN join_df2
USING(A)")
we note from the above that only
the colours
that arecommon to both tables
are returned. however instead of usingUSING
we can useON table1.columnname=table2.columnname
wherecolumnname
is the column that tables to be matched by and in our example this is columnA
. when using this method it is important to give each table anALIAS
which is ashorthand
for our tables
dbGetQuery(join,"SELECT *
FROM join_df1 AS j1
INNER JOIN join_df2 AS j2
ON j1.A=j2.A")
looking at the table above ,we note that
column A
has been repeated twice , thus we need to onlySELECT
thecolumn A
From only one of our tables hence we specify the repeated columns in theSELECT
statement using thetable.columnname
method . So in the example aboveSELECT j1.A,B,C,D
specifies that we want we want columnA
to come from tablej1
(j1.A)
dbGetQuery(join,"SELECT j1.A,B,C,D
FROM join_df1 AS j1
INNER JOIN join_df2 AS j2
ON j1.A=j2.A")
great :)! , generally we can apply the same scenario to all kinds of joins
For a left join, all rows in the first table specified will be included in the output. Any row in the second table that is not in the first table will not be included.
In our toy example this means that ArtistID
s 1, 2, and 3
will be included in the output; however, ArtistID
4 will
not.
Thus, our output will again include all the columns from both tables
combined into a single table; however, for ArtistId
3,
there will be NAs for AlbumId
and Title
. NAs
will be filled in for any observations in the first table specified that
are missing in the second table.
Let us see it in ACTION
dbGetQuery(join,"SELECT j1.A,B,C,D
FROM join_df1 AS j1
LEFT JOIN join_df2 AS j2
ON j1.A=j2.A")
great , we notice the
NA
values found in the last column
Right Join is similar to what we just discussed; however, in the output from a right join, all rows in the final table specified are included in the output. NAs will be included for any observations found in the last specified table but not in the other tables.
In our toy example, that means, information about
ArtistID
s 1, 2, and 4 will be included.
Again, in our toy example, we see that right join
combines the information across tables; however, in this case,
ArtistId
4 is included, but Name
is an NA, as
this information was not in the artists
table for this
artist.
Let us see a
RIGHT JOIN
in action
dbGetQuery(join,"SELECT j1.A,B,C,D
FROM join_df1 AS j1
RIGHT JOIN join_df2 AS j2
ON j1.A=j2.A")
great , notice now where the
NA
is displayed
Finally, a full join will take every observation from every table and include it in the output.
Thus, in our toy example, this join produces five rows, including all the observations from either table. NAs are filled in when data are missing for an observation.
let us see a
FULL JOIN
in ACTION
dbGetQuery(join,"SELECT j1.A,B,C,D
FROM join_df1 AS j1
FULL JOIN join_df2 AS j2
ON j1.A=j2.A")