0.0.1 Introduction

Greetings , hope you will enjoy SQL JOINS coming from an R afficionado who has dealt mainly with R joins . These notes are based on how much i have understood and spent some time looking for pictures from the internet to aid in the presentation. > The presentation assumes you are already familiar with a bit of SQL

0.0.2 Mutating Joins

Mutating joins allow you to take two different tables and combine the variables from both tables. This requires that each table have a column relating the tables to one another (i.e. a unique identifier). This unique identifier is used to match observations between the tables.

However, when combining tables, there are a number of different ways in which the tables can be joined:

  • Inner Join - only keep observations found in both x and y
  • Left Join - keep all observations in x
  • Right Join - keep all observations in y
  • Full Join - keep any observations in x or y

Let’s break down exactly what we mean by this using just a small toy example from the artists and albums tables from a company database. Here you see three rows from the artists table and four rows from the albums table.

0.0.3 setup packages

  • the following packages are necessary for the presentation to be possible in RSTUDIO
library(tidyverse)
library(odbc)
library(DBI)
library(RSQLite)

0.0.4 setting up the data

  • to aid with illustrations , I have created some fake datasets to assist in explaining and these are join_df1.csv and join_df2.csv datasets
  • read in CSV files and store them as databases in R
# read in example csv files (join_df1 and join_df2)
join_df1 <- read.csv("join_df1.csv") |> 
  rename(A=`ï..A`)
join_df2 <- read.csv("join_df2.csv")|> 
  rename(A=`ï..A`)

join<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(join,join_df1)
copy_to(join,join_df2)

0.1 look at the databases

0.1.1 join_df1

dbGetQuery(join,"SELECT * FROM join_df1")

this data has 8 rows and 3 columns

0.1.2 join_df2

dbGetQuery(join,"SELECT * FROM join_df2")

this has 7 rows and 2 columns

  • note that we will use the column A as our unique identifier

0.2 Querying A join

generally the following step show how a join is done in SQL

  • SELECT t1.comon_column1,t1.common_column2,
  • FROM table1 AS t1
  • <type> JOIN table2 AS t2
  • ON t1.common_unique_column = t2.common_unique_column;

0.2.1 Inner Join

When talking about inner joins, we are only going to keep an observation if it is found in all of the tables we’re combining. Here, we’re combining the tables based on the ArtistId column. In our dummy example, there are only two artists that are found in both tables. These are highlighted in green and will be the rows used to join the two tables. Then, once the inner join happens, only these artists’ data will be included after the inner join.

when doing an inner join, data from any observation found in all the tables being joined are included in the output. Here, ArtistIds “1” and “2” are in both the artists and albums tables. Thus, those will be the only ArtistIds in the output from the inner join.

And, since it’s a mutating join, our new table will have information from both tables! We now have ArtistId, Name, AlbumId, and Title in a single table! We’ve joined the two tables, based on the column ArtistId!

0.2.2 See this in ACTION

dbGetQuery(join,"SELECT * 
                 FROM join_df1
                 INNER JOIN join_df2
                 USING(A)")

we note from the above that only the colours that are common to both tables are returned. however instead of using USING we can use ON table1.columnname=table2.columnname where columnname is the column that tables to be matched by and in our example this is column A . when using this method it is important to give each table an ALIAS which is a shorthand for our tables

dbGetQuery(join,"SELECT * 
                 FROM join_df1 AS j1
                 INNER JOIN join_df2 AS j2
                 ON j1.A=j2.A")

looking at the table above ,we note that column A has been repeated twice , thus we need to only SELECT the column A From only one of our tables hence we specify the repeated columns in the SELECT statement using the table.columnname method . So in the example above SELECT j1.A,B,C,D specifies that we want we want column A to come from table j1 (j1.A)

dbGetQuery(join,"SELECT j1.A,B,C,D 
                 FROM join_df1 AS j1
                 INNER JOIN join_df2 AS j2
                 ON j1.A=j2.A")

great :)! , generally we can apply the same scenario to all kinds of joins

0.3 left join in SQL

For a left join, all rows in the first table specified will be included in the output. Any row in the second table that is not in the first table will not be included.

In our toy example this means that ArtistIDs 1, 2, and 3 will be included in the output; however, ArtistID 4 will not.

Thus, our output will again include all the columns from both tables combined into a single table; however, for ArtistId 3, there will be NAs for AlbumId and Title. NAs will be filled in for any observations in the first table specified that are missing in the second table.

Let us see it in ACTION

dbGetQuery(join,"SELECT j1.A,B,C,D 
                 FROM join_df1 AS j1
                 LEFT JOIN join_df2 AS j2
                 ON j1.A=j2.A")

great , we notice the NA values found in the last column

0.3.1 RIGHT JOIN IN SQL

Right Join is similar to what we just discussed; however, in the output from a right join, all rows in the final table specified are included in the output. NAs will be included for any observations found in the last specified table but not in the other tables.

In our toy example, that means, information about ArtistIDs 1, 2, and 4 will be included.

Again, in our toy example, we see that right join combines the information across tables; however, in this case, ArtistId 4 is included, but Name is an NA, as this information was not in the artists table for this artist.

Let us see a RIGHT JOIN in action

dbGetQuery(join,"SELECT j1.A,B,C,D 
                 FROM join_df1 AS j1
                 RIGHT JOIN join_df2 AS j2
                 ON j1.A=j2.A")

great , notice now where the NA is displayed

0.3.2 full join in SQL

Finally, a full join will take every observation from every table and include it in the output.

Thus, in our toy example, this join produces five rows, including all the observations from either table. NAs are filled in when data are missing for an observation.

let us see a FULL JOIN in ACTION

dbGetQuery(join,"SELECT j1.A,B,C,D 
                 FROM join_df1 AS j1
                 FULL JOIN join_df2 AS j2
                 ON j1.A=j2.A")

1 References

  • Tips on Getting Started with databases using R (from Rstudio)
  • Using SQL in Rstudio from Irene Steves - a discussion of the Rubymine IDE and working with SQL in RSTUDIO.
  • DATACAMP (data analyst with SQL)