This learning post is going to simplify the use-case of basic SQL queries in RStudio using R Markdown from my drafted book on Machine Learning Tools. Do you want to learn more on machine learning tools to perform data analytics? If yes, then this post might be useful for you. I’ll publish an online version of the book once its ready to go by the end of this year, 2021.
I assume that you have installed following machine learning tools on your computer to reproduce this entire work.
The process of retrieving data from a database server is known as querying. Queries allow a database user to understand the story that the data is telling. This post is going to focus on retrieving information from a sample database. This sample database was developed by Data Simulation Research Lab using Microsoft SQL Server 2019.
In order to explore and learn basic SQL queries in Microsoft SQL Server, first we need to have a database. Data Simulation Research lab (DSRLab) developed their own datawarehouse using Microsoft SQL Server 2019 which is publicly open in a GitHub repository. I am going to use this sample database here to explore basic SQL queries. You can find the download link to the database from the GitHub repository at URL https://github.com/NikhilChandraSarkar/DSRLab-Data-Warehouse. On this page, you find the section for the DSRLab full database backups and the file called DSRLabDW_Full.bak. Let’s go ahead and click on that link and download the file to your computer. Now you need to move the downloaded file into the Microsoft SQL Server backup folder on your computer and rename it as DSRLabDW. Then you can login to your Microsoft SQL Server using a Microsoft SQL Server Management Studio tool to restore the sample database on the server.
Now I need to connect my Microsoft SQL Server instance with my RStudio. The following machine learning code in RStudio using R Marckdown shows how to connect to a local Microsoft SQL Server instance:
library(odbc)
## Warning: package 'odbc' was built under R version 4.0.5
con <- dbConnect(odbc(),
Driver = "SQL Server",
Server = "localhost\\SQLEXPRESS",
Database = "DSRLabDW",
Trusted_Connection = "True")
To enable SQL quries in R Markdown for this learning, I need to use {sql, connection =con} in each chunk instead of {r}.Though SQL language is case and white space insensitive, here I use the standard convention to write all query command keywords in uppercase letters and finish the statement with a semicolon.
Let’s get started writing a SELECT statement to retrieve data from the ‘DSRLabDW’ database. Here I use a SELECT statement to select a column named ‘Crash_Severity’ from a table named ‘crash_location’ from the ‘DSRLabDW’ database. And they display values from every record in that column.
SELECT Crash_Severity
FROM dbo.crash_location;
| Crash_Severity |
|---|
| Hospitalisation |
| Property damage only |
| Property damage only |
| Property damage only |
| Property damage only |
| Medical treatment |
| Medical treatment |
| Medical treatment |
| Medical treatment |
| Medical treatment |
Let’s continue working with ‘crash_location’ table to select multiple columns using following SQL query:
SELECT Crash_Year, Crash_Severity, Crash_Nature
FROM dbo.crash_location;
| Crash_Year | Crash_Severity | Crash_Nature |
|---|---|---|
| 2001 | Hospitalisation | Head-on |
| 2001 | Property damage only | Angle |
| 2001 | Property damage only | Angle |
| 2005 | Property damage only | Struck by external load |
| 2002 | Property damage only | Hit object |
| 2012 | Medical treatment | Sideswipe |
| 2012 | Medical treatment | Angle |
| 2012 | Medical treatment | Hit object |
| 2012 | Medical treatment | Angle |
| 2012 | Medical treatment | Sideswipe |
Let’s continue working with ‘crash_location’ table to select all columns using * symbol after SELECT keyword.
SELECT *
FROM dbo.crash_location;
| Crash_Ref_Number | Crash_Severity | Crash_Year | Crash_Month | Crash_Day_Of_Week | Crash_Hour | Crash_Nature | Crash_Type | Crash_Longitude_GDA94 | Crash_Latitude_GDA94 | Crash_Street | Crash_Street_Intersecting | Loc_Post_Code | Crash_DCA_Code | Crash_DCA_Description | Count_Casualty_Fatality | Count_Casualty_Hospitalised | Count_Casualty_MedicallyTreated | Count_Casualty_MinorInjury | Loc_Suburb | Loc_Local_Government_Area | Loc_Police_Division | Loc_Police_District | Loc_Police_Region | Loc_Queensland_Transport_Region | Loc_Main_Roads_Region | Loc_ABS_Statistical_Area_2 | Loc_ABS_Statistical_Area_3 | Loc_ABS_Statistical_Area_4 | Loc_ABS_Remoteness | Loc_State_Electorate | Loc_Federal_Electorate | Crash_Controlling_Authority | Crash_Roadway_Feature | Crash_Traffic_Control | Crash_Speed_Limit | Crash_Road_Surface_Condition | Crash_Atmospheric_Condition | Crash_Lighting_Condition | Crash_Road_Horiz_Align | Crash_Road_Vert_Align | Crash_DCA_Group_Description | DCA_Key_Approach_Dir | Count_Casualty_Total | Count_Unit_Car | Count_Unit_Motorcycle_Moped | Count_Unit_Truck | Count_Unit_Bus | Count_Unit_Bicycle | Count_Unit_Pedestrian | Count_Unit_Other |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Hospitalisation | 2001 | January | Monday | 6 | Head-on | Multi-Vehicle | 153.0522011 | -27.58637006 | Gretel St | NA | 4109 | 201 | Veh’S Opposite Approach: Head On | 0 | 1 | 0 | 0 | Sunnybank Hills | Brisbane City | Acacia Ridge | South Brisbane | Brisbane | SEQ South | Metropolitan | Sunnybank Hills | Sunnybank | Brisbane - South | Major Cities | Stretton | Moreton | Locally-controlled | No Roadway Feature | No traffic control | 0 - 50 km/h | Sealed - Dry | Clear | Daylight | Curved - view obscured | Grade | Head-on | S | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 10 | Property damage only | 2001 | January | Tuesday | 15 | Angle | Multi-Vehicle | 153.0207807 | -27.57233755 | Desgrand St | Kerry Rd | 4108 | 202 | Veh’S Opposite Approach: Thru-Right | 0 | 0 | 0 | 0 | Archerfield | Brisbane City | Acacia Ridge | South Brisbane | Brisbane | SEQ South | Metropolitan | Rocklea - Acacia Ridge | Rocklea - Acacia Ridge | Brisbane - South | Major Cities | Algester | Moreton | Locally-controlled | Intersection - T-Junction | No traffic control | 60 km/h | Sealed - Dry | Clear | Daylight | Straight | Level | Opposing vehicles turning | W | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 100 | Property damage only | 2001 | June | Tuesday | 17 | Angle | Multi-Vehicle | 153.0830031 | -27.60004316 | Beenleigh Rd | Warrigal Rd | 4113 | 202 | Veh’S Opposite Approach: Thru-Right | 0 | 0 | 0 | 0 | Runcorn | Brisbane City | Upper Mount Gravatt | South Brisbane | Brisbane | SEQ South | Metropolitan | Runcorn | Sunnybank | Brisbane - South | Major Cities | Stretton | Moreton | Locally-controlled | Intersection - T-Junction | Operating traffic lights | 60 km/h | Sealed - Dry | Clear | Daylight | Straight | Level | Opposing vehicles turning | W | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1000 | Property damage only | 2005 | August | Monday | 10 | Struck by external load | Other | 153.0700223 | -27.61138557 | Compton Rd | NA | 4113 | 610 | Pass & Misc: Load Hit Vehicle | 0 | 0 | 0 | 0 | Runcorn | Brisbane City | Acacia Ridge | South Brisbane | Brisbane | SEQ South | Metropolitan | Runcorn | Sunnybank | Brisbane - South | Major Cities | Stretton | Moreton | Locally-controlled | No Roadway Feature | No traffic control | 70 km/h | Sealed - Dry | Clear | Daylight | Straight | Level | Other | W | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 10000 | Property damage only | 2002 | December | Tuesday | 5 | Hit object | Single Vehicle | 153.0009104 | -27.45869677 | Latrobe Tce | NA | 4064 | 803 | Off Path-Curve: Off Cway Rt Bend Hit Obj | 0 | 0 | 0 | 0 | Paddington | Brisbane City | Brisbane City | North Brisbane | Brisbane | SEQ North | Metropolitan | Paddington - Milton | Brisbane Inner - West | Brisbane Inner City | Major Cities | Cooper | Brisbane | Locally-controlled | No Roadway Feature | No traffic control | 0 - 50 km/h | Sealed - Wet | Raining | Daylight | Curved - view open | Grade | Off carriageway on curve hit object | E | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 100000 | Medical treatment | 2012 | August | Tuesday | 22 | Sideswipe | Multi-Vehicle | 153.0712847 | -27.693116 | Green Rd | NA | 4125 | 201 | Veh’S Opposite Approach: Head On | 0 | 0 | 2 | 1 | Park Ridge | Logan City | Crestmead | Logan | South Eastern | SEQ South | South Coast | Crestmead | Browns Plains | Logan - Beaudesert | Major Cities | Logan | Rankin | Locally-controlled | No Roadway Feature | No traffic control | 70 km/h | Sealed - Dry | Clear | Darkness - Not lighted | Straight | Grade | Head-on | W | 3 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 100001 | Medical treatment | 2012 | August | Wednesday | 7 | Angle | Multi-Vehicle | 153.081838 | -27.69754 | Bumstead Rd | Julie St | 4132 | 105 | Veh’S Adjacent Approach: Right-Right | 0 | 0 | 1 | 0 | Crestmead | Logan City | Crestmead | Logan | South Eastern | SEQ South | South Coast | Crestmead | Browns Plains | Logan - Beaudesert | Major Cities | Logan | Forde | Locally-controlled | Intersection - T-Junction | Give way sign | 80 - 90 km/h | Sealed - Dry | Clear | Daylight | Straight | Grade | Intersection from adjacent approaches | S | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 100002 | Medical treatment | 2012 | August | Saturday | 11 | Hit object | Single Vehicle | 153.0428417 | -27.65746035 | Fern St | NA | 4118 | 804 | Off Path-Curve: Off Cway Lt Bend Hit Obj | 0 | 0 | 1 | 0 | Browns Plains | Logan City | Browns Plains | Logan | South Eastern | SEQ South | South Coast | Browns Plains | Browns Plains | Logan - Beaudesert | Major Cities | Algester | Rankin | Locally-controlled | Intersection - T-Junction | No traffic control | 0 - 50 km/h | Sealed - Dry | Clear | Daylight | Straight | Grade | Off carriageway on curve hit object | E | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 100003 | Medical treatment | 2012 | August | Friday | 18 | Angle | Multi-Vehicle | 153.093004 | -27.681338 | Browns Plains Rd | Second Ave | 4132 | 202 | Veh’S Opposite Approach: Thru-Right | 0 | 0 | 2 | 1 | Crestmead | Logan City | Crestmead | Logan | South Eastern | SEQ South | South Coast | Marsden | Browns Plains | Logan - Beaudesert | Major Cities | Woodridge | Rankin | Locally-controlled | Intersection - Cross | Operating traffic lights | 70 km/h | Sealed - Dry | Clear | Dawn/Dusk | Straight | Level | Opposing vehicles turning | E | 3 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 100004 | Medical treatment | 2012 | August | Friday | 8 | Sideswipe | Multi-Vehicle | 153.0528835 | -27.66274564 | Waller Rd | NA | 4118 | 307 | Veh’S Same Direction: Lane Change Left | 0 | 0 | 1 | 0 | Browns Plains | Logan City | Browns Plains | Logan | South Eastern | SEQ South | South Coast | Browns Plains | Browns Plains | Logan - Beaudesert | Major Cities | Woodridge | Rankin | Locally-controlled | No Roadway Feature | No traffic control | 60 km/h | Sealed - Dry | Clear | Daylight | Straight | Level | Lane changes | N | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
Use TOP function to limit the records from a column
SELECT TOP(8) Crash_Severity
FROM dbo.crash_location;
| Crash_Severity |
|---|
| Hospitalisation |
| Property damage only |
| Property damage only |
| Property damage only |
| Property damage only |
| Medical treatment |
| Medical treatment |
| Medical treatment |
Use TOP function to limit the records from multiple columns
SELECT TOP (5) Crash_Year, Crash_Severity, Crash_Nature
FROM dbo.crash_location;
| Crash_Year | Crash_Severity | Crash_Nature |
|---|---|---|
| 2001 | Hospitalisation | Head-on |
| 2001 | Property damage only | Angle |
| 2001 | Property damage only | Angle |
| 2005 | Property damage only | Struck by external load |
| 2002 | Property damage only | Hit object |
Let’s use a WHERE clause to filter the data from the same table. Here I specify the condition after WHERE keyword that I want to filter by. The filtering condition in the WHERE clause is called a predicate. It usually involve a comparison operator such as this equal sign.
SELECT Crash_Year, Crash_Severity, Crash_Nature
FROM dbo.crash_location
WHERE Crash_Nature = 'Head-on';
| Crash_Year | Crash_Severity | Crash_Nature |
|---|---|---|
| 2001 | Hospitalisation | Head-on |
| 2012 | Hospitalisation | Head-on |
| 2013 | Hospitalisation | Head-on |
| 2013 | Hospitalisation | Head-on |
| 2014 | Hospitalisation | Head-on |
| 2014 | Minor injury | Head-on |
| 2014 | Hospitalisation | Head-on |
| 2014 | Hospitalisation | Head-on |
| 2014 | Hospitalisation | Head-on |
| 2015 | Medical treatment | Head-on |
Add OR with WHERE clause
SELECT Crash_Year, Crash_Severity, Crash_Nature
FROM dbo.crash_location
WHERE Crash_Nature = 'Head-on'
OR Crash_Nature = 'Hit object';
| Crash_Year | Crash_Severity | Crash_Nature |
|---|---|---|
| 2001 | Hospitalisation | Head-on |
| 2002 | Property damage only | Hit object |
| 2012 | Medical treatment | Hit object |
| 2012 | Hospitalisation | Hit object |
| 2012 | Hospitalisation | Hit object |
| 2012 | Medical treatment | Hit object |
| 2012 | Medical treatment | Hit object |
| 2012 | Minor injury | Hit object |
| 2012 | Hospitalisation | Head-on |
| 2012 | Minor injury | Hit object |
Add AND with WHERE clause
SELECT Crash_Year, Crash_Severity, Crash_Nature
FROM dbo.crash_location
WHERE Crash_Nature = 'Head-on'
AND Crash_Year =2001;
| Crash_Year | Crash_Severity | Crash_Nature |
|---|---|---|
| 2001 | Hospitalisation | Head-on |
| 2001 | Hospitalisation | Head-on |
| 2001 | Hospitalisation | Head-on |
| 2001 | Medical treatment | Head-on |
| 2001 | Fatal | Head-on |
| 2001 | Property damage only | Head-on |
| 2001 | Fatal | Head-on |
| 2001 | Minor injury | Head-on |
| 2001 | Hospitalisation | Head-on |
| 2001 | Medical treatment | Head-on |
Use WHERE Clause with IN function
SELECT Crash_Year, Crash_Severity, Crash_Nature
FROM dbo.crash_location
WHERE Crash_Nature IN ('Head-on', 'Sideswipe');
| Crash_Year | Crash_Severity | Crash_Nature |
|---|---|---|
| 2001 | Hospitalisation | Head-on |
| 2012 | Medical treatment | Sideswipe |
| 2012 | Medical treatment | Sideswipe |
| 2012 | Medical treatment | Sideswipe |
| 2012 | Hospitalisation | Head-on |
| 2012 | Medical treatment | Sideswipe |
| 2012 | Hospitalisation | Sideswipe |
| 2013 | Hospitalisation | Head-on |
| 2002 | Medical treatment | Sideswipe |
| 2013 | Medical treatment | Sideswipe |
Use AS Clause to rename a column from a table.
SELECT TOP(5) Crash_Nature AS 'Crash Type'
FROM dbo.crash_location;
| Crash Type |
|---|
| Head-on |
| Angle |
| Angle |
| Struck by external load |
| Hit object |
Use AS Clause to rename multiple columns from a table.
SELECT TOP(5) Crash_Severity AS 'Severity of Crash', Crash_Nature AS 'Crash Type'
FROM dbo.crash_location;
| Severity of Crash | Crash Type |
|---|---|
| Hospitalisation | Head-on |
| Property damage only | Angle |
| Property damage only | Angle |
| Property damage only | Struck by external load |
| Property damage only | Hit object |
Let’s use ORDER BY clause that we can add to SELECT statement to sort records into an order such as DESC for descending order and ASC for ascending order.
SELECT TOP (10) Crash_Year, Crash_Severity, Crash_Nature
FROM dbo.crash_location
ORDER BY Crash_Year ASC;
| Crash_Year | Crash_Severity | Crash_Nature |
|---|---|---|
| 2001 | Property damage only | Angle |
| 2001 | Property damage only | Angle |
| 2001 | Hospitalisation | Head-on |
| 2001 | Property damage only | Hit parked vehicle |
| 2001 | Medical treatment | Angle |
| 2001 | Medical treatment | Rear-end |
| 2001 | Fatal | Hit object |
| 2001 | Property damage only | Rear-end |
| 2001 | Minor injury | Hit parked vehicle |
| 2001 | Minor injury | Angle |
SELECT TOP (10) Crash_Year, Crash_Severity, Crash_Nature
FROM dbo.crash_location
ORDER BY Crash_Year DESC;
| Crash_Year | Crash_Severity | Crash_Nature |
|---|---|---|
| 2019 | Minor injury | Angle |
| 2019 | Hospitalisation | Rear-end |
| 2019 | Hospitalisation | Hit parked vehicle |
| 2019 | Hospitalisation | Angle |
| 2019 | Hospitalisation | Angle |
| 2019 | Hospitalisation | Rear-end |
| 2019 | Medical treatment | Rear-end |
| 2019 | Hospitalisation | Hit pedestrian |
| 2019 | Minor injury | Rear-end |
| 2019 | Hospitalisation | Hit object |
To find a text that matches with a pattern, we need to use the LIKE operator in WHERE clause. Here is an example to filter the Crash_Severity name started by H. The LIKE operator and % symbol (that is called a wildcard) create a text pattern and the query returns the records from the column that match the pattern.
SELECT Crash_Severity
FROM dbo.crash_location
WHERE Crash_Severity LIKE 'H%';
| Crash_Severity |
|---|
| Hospitalisation |
| Hospitalisation |
| Hospitalisation |
| Hospitalisation |
| Hospitalisation |
| Hospitalisation |
| Hospitalisation |
| Hospitalisation |
| Hospitalisation |
| Hospitalisation |
Let’s use GROUP BY statement to group rows that have the same values.
SELECT Crash_Year,Crash_Nature, COUNT(*) AS CountOfCrashNature
FROM dbo.crash_location
GROUP BY Crash_Year, Crash_Nature;
| Crash_Year | Crash_Nature | CountOfCrashNature |
|---|---|---|
| 2006 | Hit object | 5437 |
| 2008 | Non-collision - miscellaneous | 60 |
| 2019 | Sideswipe | 710 |
| 2012 | Overturned | 656 |
| 2001 | Hit pedestrian | 880 |
| 2019 | Struck by internal load | 2 |
| 2005 | Fall from vehicle | 523 |
| 2001 | Other | 3 |
| 2012 | Collision - miscellaneous | 17 |
| 2014 | Rear-end | 3303 |
I’ll continue this learning in another posts from here to advance level.