Email             :
Instagram     : https://www.instagram.com/dsciencelabs
RPubs            : https://rpubs.com/dsciencelabs/
Github           : https://github.com/dsciencelabs/
Telegram       : @dsciencelabs
Department  : Business Statistics
Address         : ARA Center, Matana University Tower
                         Jl. CBD Barat Kav, RT.1, Curug Sangereng, Kelapa Dua, Tangerang, Banten 15810.




1 What is a SQL join?

A SQL join is a Structured Query Language (SQL) instruction to combine data from two sets of data (i.e. two tables). Before we dive into the details of a SQL join, let’s briefly discuss what SQL is, and why someone would want to perform a SQL join.

SQL is a special-purpose programming language designed for managing information in a relational database management system (RDBMS). The word relational here is key; it specifies that the database management system is organized in such a way that there are clear relations defined between different sets of data.

Typically, you need to extract, transform, and load data into your RDBMS before you’re able to manage it using SQL, which you can accomplish by using a tool like Stitch.


2 Relational Database

Imagine you’re running a store and would like to record information about your customers and their orders. By using a relational database, you can save this information as two tables that represent two distinct entities: customers and orders .

2.1 Table Customers

Table 1, informs about each customer is stored in its own row, with columns specifying different bits of information, including their first name, last name, and email address. Additionally, we associate a unique customer number, or primary key, with each customer record.

2.2 Table Orders

Again, Table 2 are contains information about a specific order. Each order has its own unique identification key order_id for this table – assigned to it as well.

2.3 Relational Model

You’ve probably noticed that these two examples share similar information. You can see these simple relations diagrammed below:

Note that the orders table contains two keys: one for the order and one for the customer who placed that order. In scenarios when there are multiple keys in a table, the key that refers to the entity being described in that table is called the primary key (PK) and other key is called a foreign key (FK).

In our example, order_id is a primary key in the orders table, while customer_id is both a primary key in the customers table and a foreign key in the orders table. Primary and foreign keys are essential to describing relations between the tables, and in performing SQL joins.


2.4 Factory Database

To make you more convenient about all the data tables that we will use in this section. Here, I summarize the following SQL relational for database factory_db:

Note: Don’t forget to consider the data structure of your database (all table)


3 Basic SQL Join Types

There are four basic types of SQL joins: inner, left, right, and full. The easiest and most intuitive way to explain the difference between these four types is by using a Venn diagram, which shows all possible logical relations between data sets.

Again, it’s important to stress that before you can begin using any join type, you’ll need to extract the data and load it into an RDBMS like Amazon Redshift, where you can query tables from multiple sources. You build that process manually, or you can use an ETL service like Stitch, which automates that process for you.

3.1 Connect to MySQL

Reading data from MySQL into R workspace, it requires two R libraries, RMySQL and DBI. The connection data should not be embedded in analysis code. Separate the connection code in another script. The script should set up the connection and save it into the workspace.

The saved connection is accessible by its name in the analysis code. In the dbConnect function, you need to replace dbname, username, pwd, dbserver and port with the actual values of your remote database.

# set up the connection and save it into the workspace
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
library(RMySQL)
library(DBI)
bakti <- dbConnect(RMySQL::MySQL(),
                  dbname='factory_db',
                  username='root',
                  password='', 
                  host='localhost',
                  port=3306)
knitr::opts_chunk$set(connection = "bakti")   # set up the connection 

After set up the connection and save it into the workspace. Next, we can run SQL in a code chunk of type sql. By setting the connection in the code chuck and adding the option output.var, the resulting table from the SQL is written into a variable in R.

'''{sql connection=bakti, output.var="report_model_by_make"} 
      Your SQL code Here
'''

3.2 Inner Join

Let’s say we wanted to get a list of those customers who placed an order and the details of the order they placed. This would be a perfect fit for an inner join, since an inner join returns records at the intersection of the two tables.

SELECT OrderID, CustomerName
  FROM Orders O
    INNER JOIN Customers C
      ON O.CustomerID = C.CustomerID
library(DT)
datatable(Inner1, 
          caption = htmltools::tags$caption(
            style = 'caption-side: bottom; text-align: center;', 
            htmltools::em('Table 3: SQL Inner Join Two Tables.')))

The following SQL statement selects all orders with customer and shipper information:

SELECT * 
  FROM ((Orders O
    INNER JOIN Customers C
      ON O.CustomerID = C.CustomerID)
    INNER JOIN Shippers S 
      ON O.ShipperID = S.ShipperID)
datatable(Inner2, 
          caption = htmltools::tags$caption(
            style = 'caption-side: bottom; text-align: center;', 
            htmltools::em('Table 4: SQL Inner Join Three Tables.')),
          extensions = 'FixedColumns',
          options = list(scrollX = TRUE, fixedColumns = TRUE)
          )

3.3 Left Join

If we wanted to simply append information about orders to our customers table, regardless of whether a customer placed an order or not, we would use a left join. A left join returns all records from table A and any matching records from table B. The result is NULL from the right side, if there is no match.

SELECT CustomerName, OrderID
  FROM Customers C
    LEFT JOIN Orders O
      ON C.CustomerID = O.CustomerID
        ORDER BY C.CustomerName
datatable(Left, 
          caption = htmltools::tags$caption(
            style = 'caption-side: bottom; text-align: center;', 
            htmltools::em('Table 5: SQL Left Join Two Tables.')))

3.4 Right Join

The following SQL statement will return all employees, and any orders they might have placed. The result is NULL from the left side, when there is no match.

SELECT OrderID, LastName, FirstName
  FROM Orders O
    RIGHT JOIN Employees E
      ON O.EmployeeID = E.EmployeeID
        ORDER BY O.OrderID
datatable(Right, 
          caption = htmltools::tags$caption(
            style = 'caption-side: bottom; text-align: center;', 
            htmltools::em('Table 6: SQL Right Join Two Tables.')))

3.5 Full Join

The FULL OUTER JOIN keyword returns all records when there is a match in left (table1) or right (table2) table records. FULL OUTER JOIN, FULL JOIN, and JOIN (MariaDB) are the same. The following SQL statement selects all customers, and all orders:

SELECT CustomerName, OrderID
  FROM Customers C
    JOIN Orders O
      ON C.CustomerID=O.CustomerID
        ORDER BY C.CustomerName
datatable(Full, 
          caption = htmltools::tags$caption(
            style = 'caption-side: bottom; text-align: center;', 
            htmltools::em('Table 7: SQL Full Join Two Tables.')))

3.6 Self JOIN

A self JOIN is a regular join, but the table is joined with itself. The following SQL statement matches customers that are from the same city:

SELECT A.CustomerName AS CustomerName1, 
       B.CustomerName AS CustomerName2, 
       A.City
  FROM Customers A, 
       Customers B
    WHERE A.CustomerID <> B.CustomerID
      AND A.City = B.City
        ORDER BY A.City
datatable(Self, 
          caption = htmltools::tags$caption(
            style = 'caption-side: bottom; text-align: center;', 
            htmltools::em('Table 8: SQL Self Join Two Tables.')))

After finishing the work with the database, close the connection.

DBI::dbDisconnect(bakti)
## [1] TRUE

4 Excercise

  1. Apply Left join and Right join to returns all records from table Orders and any matching records from table Suppliers.

  2. Choose the correct JOIN clause to select all records from the two tables (Orders and Suppliers) where there is a match in both tables.

  3. Choose the correct JOIN clause to select all the records from the Suppliers table plus all the matches in the Orders table.