Greetings , hope you will enjoy SQL JOINS coming from an
R afficionado who has dealt mainly with
R joins . These notes are based on how much i have
understood and spent some time looking for pictures from
the internet to aid in the presentation. > The presentation assumes
you are already familiar with a bit of SQL
Mutating joins allow you to take two different tables and combine the variables from both tables. This requires that each table have a column relating the tables to one another (i.e. a unique identifier). This unique identifier is used to match observations between the tables.
However, when combining tables, there are a number of different ways in which the tables can be joined:
x and yxyx
or yLet’s break down exactly what we mean by this using just a small toy
example from the artists and albums tables
from a company database. Here you see three rows from the
artists table and four rows from the albums
table.
RSTUDIOlibrary(tidyverse)
library(odbc)
library(DBI)
library(RSQLite)fake datasets to assist in explaining and these are
join_df1.csv and join_df2.csv datasets# read in example csv files (join_df1 and join_df2)
join_df1 <- read.csv("join_df1.csv") |>
rename(A=`ï..A`)
join_df2 <- read.csv("join_df2.csv")|>
rename(A=`ï..A`)
join<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(join,join_df1)
copy_to(join,join_df2)dbGetQuery(join,"SELECT * FROM join_df1")this data has 8 rows and 3 columns
dbGetQuery(join,"SELECT * FROM join_df2")this has 7 rows and 2 columns
A as our
unique identifiergenerally the following step show how a join is done in SQL
SELECT
t1.comon_column1,t1.common_column2,…FROM table1 AS t1<type> JOIN table2 AS t2ON t1.common_unique_column =
t2.common_unique_column;When talking about inner joins, we are only going to keep an
observation if it is found in all of the tables we’re combining. Here,
we’re combining the tables based on the ArtistId column. In
our dummy example, there are only two artists that are found in
both tables. These are highlighted in green and will be the
rows used to join the two tables. Then, once the inner join happens,
only these artists’ data will be included after the inner join.
when doing an inner join, data from any observation
found in all the tables being joined are included in the output. Here,
ArtistIds “1” and “2” are in both the artists
and albums tables. Thus, those will be the only
ArtistIds in the output from the inner join.
And, since it’s a mutating join, our new table will have information
from both tables! We now have ArtistId, Name,
AlbumId, and Title in a single table!
We’ve joined the two tables, based on the column
ArtistId!
ACTIONdbGetQuery(join,"SELECT *
FROM join_df1
INNER JOIN join_df2
USING(A)")we note from the above that only
the coloursthat arecommon to both tablesare returned. however instead of usingUSINGwe can useON table1.columnname=table2.columnnamewherecolumnnameis the column that tables to be matched by and in our example this is columnA. when using this method it is important to give each table anALIASwhich is ashorthandfor our tables
dbGetQuery(join,"SELECT *
FROM join_df1 AS j1
INNER JOIN join_df2 AS j2
ON j1.A=j2.A")looking at the table above ,we note that
column Ahas been repeated twice , thus we need to onlySELECTthecolumn AFrom only one of our tables hence we specify the repeated columns in theSELECTstatement using thetable.columnnamemethod . So in the example aboveSELECT j1.A,B,C,Dspecifies that we want we want columnAto come from tablej1(j1.A)
dbGetQuery(join,"SELECT j1.A,B,C,D
FROM join_df1 AS j1
INNER JOIN join_df2 AS j2
ON j1.A=j2.A")great :)! , generally we can apply the same scenario to all kinds of joins
For a left join, all rows in the first table specified will be included in the output. Any row in the second table that is not in the first table will not be included.
In our toy example this means that ArtistIDs 1, 2, and 3
will be included in the output; however, ArtistID 4 will
not.
Thus, our output will again include all the columns from both tables
combined into a single table; however, for ArtistId 3,
there will be NAs for AlbumId and Title. NAs
will be filled in for any observations in the first table specified that
are missing in the second table.
Let us see it in ACTION
dbGetQuery(join,"SELECT j1.A,B,C,D
FROM join_df1 AS j1
LEFT JOIN join_df2 AS j2
ON j1.A=j2.A")great , we notice the
NAvalues found in the last column
Right Join is similar to what we just discussed; however, in the output from a right join, all rows in the final table specified are included in the output. NAs will be included for any observations found in the last specified table but not in the other tables.
In our toy example, that means, information about
ArtistIDs 1, 2, and 4 will be included.
Again, in our toy example, we see that right join
combines the information across tables; however, in this case,
ArtistId 4 is included, but Name is an NA, as
this information was not in the artists table for this
artist.
Let us see a
RIGHT JOINin action
dbGetQuery(join,"SELECT j1.A,B,C,D
FROM join_df1 AS j1
RIGHT JOIN join_df2 AS j2
ON j1.A=j2.A")great , notice now where the
NAis displayed
Finally, a full join will take every observation from every table and include it in the output.
Thus, in our toy example, this join produces five rows, including all the observations from either table. NAs are filled in when data are missing for an observation.
let us see a
FULL JOINin ACTION
dbGetQuery(join,"SELECT j1.A,B,C,D
FROM join_df1 AS j1
FULL JOIN join_df2 AS j2
ON j1.A=j2.A")