Who for this Learning Suitable

This learning post is going to simplify the use-case of basic SQL queries in RStudio using R Markdown from my drafted book on Machine Learning Tools. Do you want to learn more on machine learning tools to perform data analytics? If yes, then this post might be useful for you. I’ll publish an online version of the book once its ready to go by the end of this year, 2021.

Required Machine Learning Tools

I assume that you have installed following machine learning tools on your computer to reproduce this entire work.

  1. Microsoft SQL Server 2015 or above,
  2. Microsoft SQL Server Management Studio,
  3. R and RStudio.

Retriving Data from a Database Server

The process of retrieving data from a database server is known as querying. Queries allow a database user to understand the story that the data is telling. This post is going to focus on retrieving information from a sample database. This sample database was developed by Data Simulation Research Lab using Microsoft SQL Server 2019.

Sample Database

In order to explore and learn basic SQL queries in Microsoft SQL Server, first we need to have a database. Data Simulation Research lab (DSRLab) developed their own datawarehouse using Microsoft SQL Server 2019 which is publicly open in a GitHub repository. I am going to use this sample database here to explore basic SQL queries. You can find the download link to the database from the GitHub repository at URL https://github.com/NikhilChandraSarkar/DSRLab-Data-Warehouse. On this page, you find the section for the DSRLab full database backups and the file called DSRLabDW_Full.bak. Let’s go ahead and click on that link and download the file to your computer. Now you need to move the downloaded file into the Microsoft SQL Server backup folder on your computer and rename it as DSRLabDW. Then you can login to your Microsoft SQL Server using a Microsoft SQL Server Management Studio tool to restore the sample database on the server.

Connect Mircosoft SQL Server with RStudio

Now I need to connect my Microsoft SQL Server instance with my RStudio. The following machine learning code in RStudio using R Marckdown shows how to connect to a local Microsoft SQL Server instance:

library(odbc)
## Warning: package 'odbc' was built under R version 4.0.5
con <- dbConnect(odbc(), 
                 Driver = "SQL Server", 
                 Server = "localhost\\SQLEXPRESS",
                 Database = "DSRLabDW",
                 Trusted_Connection = "True")

Write SQL queries in R Markdown

To enable SQL quries in R Markdown for this learning, I need to use {sql, connection =con} in each chunk instead of {r}.Though SQL language is case and white space insensitive, here I use the standard convention to write all query command keywords in uppercase letters and finish the statement with a semicolon.

Let’s get started writing a SELECT statement to retrieve data from the ‘DSRLabDW’ database. Here I use a SELECT statement to select a column named ‘Crash_Severity’ from a table named ‘crash_location’ from the ‘DSRLabDW’ database. And they display values from every record in that column.

SELECT  Crash_Severity
FROM dbo.crash_location;
Displaying records 1 - 10
Crash_Severity
Hospitalisation
Property damage only
Property damage only
Property damage only
Property damage only
Medical treatment
Medical treatment
Medical treatment
Medical treatment
Medical treatment

Let’s continue working with ‘crash_location’ table to select multiple columns using following SQL query:

SELECT  Crash_Year, Crash_Severity, Crash_Nature
FROM dbo.crash_location;
Displaying records 1 - 10
Crash_Year Crash_Severity Crash_Nature
2001 Hospitalisation Head-on
2001 Property damage only Angle
2001 Property damage only Angle
2005 Property damage only Struck by external load
2002 Property damage only Hit object
2012 Medical treatment Sideswipe
2012 Medical treatment Angle
2012 Medical treatment Hit object
2012 Medical treatment Angle
2012 Medical treatment Sideswipe

Let’s continue working with ‘crash_location’ table to select all columns using * symbol after SELECT keyword.

SELECT *
FROM dbo.crash_location;
Displaying records 1 - 10
Crash_Ref_Number Crash_Severity Crash_Year Crash_Month Crash_Day_Of_Week Crash_Hour Crash_Nature Crash_Type Crash_Longitude_GDA94 Crash_Latitude_GDA94 Crash_Street Crash_Street_Intersecting Loc_Post_Code Crash_DCA_Code Crash_DCA_Description Count_Casualty_Fatality Count_Casualty_Hospitalised Count_Casualty_MedicallyTreated Count_Casualty_MinorInjury Loc_Suburb Loc_Local_Government_Area Loc_Police_Division Loc_Police_District Loc_Police_Region Loc_Queensland_Transport_Region Loc_Main_Roads_Region Loc_ABS_Statistical_Area_2 Loc_ABS_Statistical_Area_3 Loc_ABS_Statistical_Area_4 Loc_ABS_Remoteness Loc_State_Electorate Loc_Federal_Electorate Crash_Controlling_Authority Crash_Roadway_Feature Crash_Traffic_Control Crash_Speed_Limit Crash_Road_Surface_Condition Crash_Atmospheric_Condition Crash_Lighting_Condition Crash_Road_Horiz_Align Crash_Road_Vert_Align Crash_DCA_Group_Description DCA_Key_Approach_Dir Count_Casualty_Total Count_Unit_Car Count_Unit_Motorcycle_Moped Count_Unit_Truck Count_Unit_Bus Count_Unit_Bicycle Count_Unit_Pedestrian Count_Unit_Other
1 Hospitalisation 2001 January Monday 6 Head-on Multi-Vehicle 153.0522011 -27.58637006 Gretel St NA 4109 201 Veh’S Opposite Approach: Head On 0 1 0 0 Sunnybank Hills Brisbane City Acacia Ridge South Brisbane Brisbane SEQ South Metropolitan Sunnybank Hills Sunnybank Brisbane - South Major Cities Stretton Moreton Locally-controlled No Roadway Feature No traffic control 0 - 50 km/h Sealed - Dry Clear Daylight Curved - view obscured Grade Head-on S 1 1 0 0 0 1 0 0
10 Property damage only 2001 January Tuesday 15 Angle Multi-Vehicle 153.0207807 -27.57233755 Desgrand St Kerry Rd 4108 202 Veh’S Opposite Approach: Thru-Right 0 0 0 0 Archerfield Brisbane City Acacia Ridge South Brisbane Brisbane SEQ South Metropolitan Rocklea - Acacia Ridge Rocklea - Acacia Ridge Brisbane - South Major Cities Algester Moreton Locally-controlled Intersection - T-Junction No traffic control 60 km/h Sealed - Dry Clear Daylight Straight Level Opposing vehicles turning W 0 2 0 0 0 0 0 0
100 Property damage only 2001 June Tuesday 17 Angle Multi-Vehicle 153.0830031 -27.60004316 Beenleigh Rd Warrigal Rd 4113 202 Veh’S Opposite Approach: Thru-Right 0 0 0 0 Runcorn Brisbane City Upper Mount Gravatt South Brisbane Brisbane SEQ South Metropolitan Runcorn Sunnybank Brisbane - South Major Cities Stretton Moreton Locally-controlled Intersection - T-Junction Operating traffic lights 60 km/h Sealed - Dry Clear Daylight Straight Level Opposing vehicles turning W 0 2 0 0 0 0 0 0
1000 Property damage only 2005 August Monday 10 Struck by external load Other 153.0700223 -27.61138557 Compton Rd NA 4113 610 Pass & Misc: Load Hit Vehicle 0 0 0 0 Runcorn Brisbane City Acacia Ridge South Brisbane Brisbane SEQ South Metropolitan Runcorn Sunnybank Brisbane - South Major Cities Stretton Moreton Locally-controlled No Roadway Feature No traffic control 70 km/h Sealed - Dry Clear Daylight Straight Level Other W 0 1 0 1 0 0 0 0
10000 Property damage only 2002 December Tuesday 5 Hit object Single Vehicle 153.0009104 -27.45869677 Latrobe Tce NA 4064 803 Off Path-Curve: Off Cway Rt Bend Hit Obj 0 0 0 0 Paddington Brisbane City Brisbane City North Brisbane Brisbane SEQ North Metropolitan Paddington - Milton Brisbane Inner - West Brisbane Inner City Major Cities Cooper Brisbane Locally-controlled No Roadway Feature No traffic control 0 - 50 km/h Sealed - Wet Raining Daylight Curved - view open Grade Off carriageway on curve hit object E 0 2 0 0 0 0 0 0
100000 Medical treatment 2012 August Tuesday 22 Sideswipe Multi-Vehicle 153.0712847 -27.693116 Green Rd NA 4125 201 Veh’S Opposite Approach: Head On 0 0 2 1 Park Ridge Logan City Crestmead Logan South Eastern SEQ South South Coast Crestmead Browns Plains Logan - Beaudesert Major Cities Logan Rankin Locally-controlled No Roadway Feature No traffic control 70 km/h Sealed - Dry Clear Darkness - Not lighted Straight Grade Head-on W 3 2 0 0 0 0 0 0
100001 Medical treatment 2012 August Wednesday 7 Angle Multi-Vehicle 153.081838 -27.69754 Bumstead Rd Julie St 4132 105 Veh’S Adjacent Approach: Right-Right 0 0 1 0 Crestmead Logan City Crestmead Logan South Eastern SEQ South South Coast Crestmead Browns Plains Logan - Beaudesert Major Cities Logan Forde Locally-controlled Intersection - T-Junction Give way sign 80 - 90 km/h Sealed - Dry Clear Daylight Straight Grade Intersection from adjacent approaches S 1 2 0 0 0 0 0 0
100002 Medical treatment 2012 August Saturday 11 Hit object Single Vehicle 153.0428417 -27.65746035 Fern St NA 4118 804 Off Path-Curve: Off Cway Lt Bend Hit Obj 0 0 1 0 Browns Plains Logan City Browns Plains Logan South Eastern SEQ South South Coast Browns Plains Browns Plains Logan - Beaudesert Major Cities Algester Rankin Locally-controlled Intersection - T-Junction No traffic control 0 - 50 km/h Sealed - Dry Clear Daylight Straight Grade Off carriageway on curve hit object E 1 1 0 0 0 0 0 0
100003 Medical treatment 2012 August Friday 18 Angle Multi-Vehicle 153.093004 -27.681338 Browns Plains Rd Second Ave 4132 202 Veh’S Opposite Approach: Thru-Right 0 0 2 1 Crestmead Logan City Crestmead Logan South Eastern SEQ South South Coast Marsden Browns Plains Logan - Beaudesert Major Cities Woodridge Rankin Locally-controlled Intersection - Cross Operating traffic lights 70 km/h Sealed - Dry Clear Dawn/Dusk Straight Level Opposing vehicles turning E 3 2 0 0 0 0 0 0
100004 Medical treatment 2012 August Friday 8 Sideswipe Multi-Vehicle 153.0528835 -27.66274564 Waller Rd NA 4118 307 Veh’S Same Direction: Lane Change Left 0 0 1 0 Browns Plains Logan City Browns Plains Logan South Eastern SEQ South South Coast Browns Plains Browns Plains Logan - Beaudesert Major Cities Woodridge Rankin Locally-controlled No Roadway Feature No traffic control 60 km/h Sealed - Dry Clear Daylight Straight Level Lane changes N 1 2 0 0 0 0 0 0

Use TOP function to limit the records from a column

SELECT TOP(8) Crash_Severity
FROM dbo.crash_location;
8 records
Crash_Severity
Hospitalisation
Property damage only
Property damage only
Property damage only
Property damage only
Medical treatment
Medical treatment
Medical treatment

Use TOP function to limit the records from multiple columns

SELECT TOP (5) Crash_Year, Crash_Severity, Crash_Nature
FROM dbo.crash_location;
5 records
Crash_Year Crash_Severity Crash_Nature
2001 Hospitalisation Head-on
2001 Property damage only Angle
2001 Property damage only Angle
2005 Property damage only Struck by external load
2002 Property damage only Hit object

Let’s use a WHERE clause to filter the data from the same table. Here I specify the condition after WHERE keyword that I want to filter by. The filtering condition in the WHERE clause is called a predicate. It usually involve a comparison operator such as this equal sign.

SELECT Crash_Year, Crash_Severity, Crash_Nature
FROM dbo.crash_location
WHERE Crash_Nature = 'Head-on';
Displaying records 1 - 10
Crash_Year Crash_Severity Crash_Nature
2001 Hospitalisation Head-on
2012 Hospitalisation Head-on
2013 Hospitalisation Head-on
2013 Hospitalisation Head-on
2014 Hospitalisation Head-on
2014 Minor injury Head-on
2014 Hospitalisation Head-on
2014 Hospitalisation Head-on
2014 Hospitalisation Head-on
2015 Medical treatment Head-on

Add OR with WHERE clause

SELECT  Crash_Year, Crash_Severity, Crash_Nature
FROM dbo.crash_location
WHERE Crash_Nature = 'Head-on'
OR Crash_Nature = 'Hit object';
Displaying records 1 - 10
Crash_Year Crash_Severity Crash_Nature
2001 Hospitalisation Head-on
2002 Property damage only Hit object
2012 Medical treatment Hit object
2012 Hospitalisation Hit object
2012 Hospitalisation Hit object
2012 Medical treatment Hit object
2012 Medical treatment Hit object
2012 Minor injury Hit object
2012 Hospitalisation Head-on
2012 Minor injury Hit object

Add AND with WHERE clause

SELECT  Crash_Year, Crash_Severity, Crash_Nature
FROM dbo.crash_location
WHERE Crash_Nature = 'Head-on'
AND Crash_Year =2001;
Displaying records 1 - 10
Crash_Year Crash_Severity Crash_Nature
2001 Hospitalisation Head-on
2001 Hospitalisation Head-on
2001 Hospitalisation Head-on
2001 Medical treatment Head-on
2001 Fatal Head-on
2001 Property damage only Head-on
2001 Fatal Head-on
2001 Minor injury Head-on
2001 Hospitalisation Head-on
2001 Medical treatment Head-on

Use WHERE Clause with IN function

SELECT  Crash_Year, Crash_Severity, Crash_Nature
FROM dbo.crash_location
WHERE Crash_Nature IN ('Head-on', 'Sideswipe');
Displaying records 1 - 10
Crash_Year Crash_Severity Crash_Nature
2001 Hospitalisation Head-on
2012 Medical treatment Sideswipe
2012 Medical treatment Sideswipe
2012 Medical treatment Sideswipe
2012 Hospitalisation Head-on
2012 Medical treatment Sideswipe
2012 Hospitalisation Sideswipe
2013 Hospitalisation Head-on
2002 Medical treatment Sideswipe
2013 Medical treatment Sideswipe

Use AS Clause to rename a column from a table.

SELECT TOP(5) Crash_Nature AS 'Crash Type'
FROM dbo.crash_location;
5 records
Crash Type
Head-on
Angle
Angle
Struck by external load
Hit object

Use AS Clause to rename multiple columns from a table.

SELECT  TOP(5) Crash_Severity AS 'Severity of Crash', Crash_Nature AS 'Crash Type'
FROM dbo.crash_location;
5 records
Severity of Crash Crash Type
Hospitalisation Head-on
Property damage only Angle
Property damage only Angle
Property damage only Struck by external load
Property damage only Hit object

Let’s use ORDER BY clause that we can add to SELECT statement to sort records into an order such as DESC for descending order and ASC for ascending order.

SELECT  TOP (10) Crash_Year, Crash_Severity, Crash_Nature
FROM dbo.crash_location
ORDER BY Crash_Year ASC;
Displaying records 1 - 10
Crash_Year Crash_Severity Crash_Nature
2001 Property damage only Angle
2001 Property damage only Angle
2001 Hospitalisation Head-on
2001 Property damage only Hit parked vehicle
2001 Medical treatment Angle
2001 Medical treatment Rear-end
2001 Fatal Hit object
2001 Property damage only Rear-end
2001 Minor injury Hit parked vehicle
2001 Minor injury Angle
SELECT  TOP (10) Crash_Year, Crash_Severity, Crash_Nature
FROM dbo.crash_location
ORDER BY Crash_Year DESC;
Displaying records 1 - 10
Crash_Year Crash_Severity Crash_Nature
2019 Minor injury Angle
2019 Hospitalisation Rear-end
2019 Hospitalisation Hit parked vehicle
2019 Hospitalisation Angle
2019 Hospitalisation Angle
2019 Hospitalisation Rear-end
2019 Medical treatment Rear-end
2019 Hospitalisation Hit pedestrian
2019 Minor injury Rear-end
2019 Hospitalisation Hit object

To find a text that matches with a pattern, we need to use the LIKE operator in WHERE clause. Here is an example to filter the Crash_Severity name started by H. The LIKE operator and % symbol (that is called a wildcard) create a text pattern and the query returns the records from the column that match the pattern.

SELECT Crash_Severity
FROM dbo.crash_location
WHERE Crash_Severity LIKE 'H%';
Displaying records 1 - 10
Crash_Severity
Hospitalisation
Hospitalisation
Hospitalisation
Hospitalisation
Hospitalisation
Hospitalisation
Hospitalisation
Hospitalisation
Hospitalisation
Hospitalisation

Let’s use GROUP BY statement to group rows that have the same values.

SELECT Crash_Year,Crash_Nature, COUNT(*) AS CountOfCrashNature
FROM dbo.crash_location
GROUP BY Crash_Year, Crash_Nature;
Displaying records 1 - 10
Crash_Year Crash_Nature CountOfCrashNature
2006 Hit object 5437
2008 Non-collision - miscellaneous 60
2019 Sideswipe 710
2012 Overturned 656
2001 Hit pedestrian 880
2019 Struck by internal load 2
2005 Fall from vehicle 523
2001 Other 3
2012 Collision - miscellaneous 17
2014 Rear-end 3303

I’ll continue this learning in another posts from here to advance level.