The purpose of this document is to arm the reader with the knowledge and skills necessary to query the Point of Sales (POS) datebase. After treading this, however, you will not be a data architect. A data analyst? Maybe. You should, after reading this, know or be able to perform the following:
The database involved is referred to as a relational database. A relational database is defined in Wikipedia as “a database that has a collection of tables of data items, all of which is formally described and organized accordingo to the relational model.” In the relational model, each table, a collection of rows or tuples and columns, fields, or attributes, represents a relation. Table 1 is an example of a table as a part of a relational database. In table 1, “Item Number”, “Description”, “Category”, and “Subcategory” are columns or attributes. Conversely, each row or tuple is a different product in the case of the inventory. If this were a dataset with patients, each row would represent a case or observation.
Furthermore, databases use keys to establish relationships between tables. A primary key should uniquely identify a tuple or row in a table. A perfect example of this is the inventory table (see below table), which uses the “Item Number” field or attribute as the primary key.
A foreign key is a field that matches a primary key of another table. The ticket history table (see below) provides an example of how the primary key from the inventory table serves as the foreign key in another table. In the provided example, the “Item Number” from the inventory table serves as the foreign key so that ticket history data can be related or connected to the inventory or product data table.
Hopefully, by now you understand or have an idea of how data is structured. Both table shown here are carefully clipped snippets of what is a much more vast and complicated database.
Before you go any further, you will need to equip yourself with a text editor, and not just any will do. The kind of text editor you should look for, even if you don't plan on coding in the future, should be able to highlight language-specific syntax – in this case SQL – and be free. I recommend Sublime Text 3, which can be downloaded here. Once Sublime Text 3 is downloaded for your respective operating system, install it using the default settings and open the program.
Once open, click the following:
* View
* Syntax
* SQL
And just like that your text editor is set up to highlight SQL-specific syntax!
Often pronounced like the word 'sequel', Structured Query Language (or SQL for short) is the data management language for relational databases. In other words, it's the bread and butter for databases.
The functionality of SQL is normally grouped into two areas: the data definition language and the data manipulation language. For the purpose of querying, this guide will not go into depth on the data definition language, which is generally used for creating tables and constraints. The data manipulation language (DML) lets you, as the name implies, manipulate data with declarative or procedural statements. Because the POS database is already set up – good to go from the box – and you are not a data architect, there is no dire need to go over declarative statements such as 'INSERT', 'UPDATE', or 'DELETE'. WIthout the full knowledge of the DML, tampering with the data could be disastrous and irreparable. Instead, we will focus on 'SELECT' statements.
A query is a way for users to not only interact and extract data from a database. It is a logical way of culling data in a way that can answer some lurking questions you have. Moreover, the data you pull with a query can take multiple formats. It can be raw and unstructured, clean and brief, and everything in between. The type of data you extract and how you extract it, the output, will depend on the type of question you have. For the more statistically-inclined, the output file (.csv) can be read into statistical analysis software such as R, SAS, SPSS, Stata for further cleaning, analyses, modeling, and visualizations.
Fortunately for you and the rest of the English-speaking population, the syntax of SQL is closely modeled after good 'ole English. If you've followed along this far you probably know English and thus learning how to query should be a piece of cake.
Below is the basic structure of a query:
SELECT table1.column1 AS name1, table1.column2 AS name2
FROM table1
WHERE condition1
AND condition2
OR condition3
GROUP BY column1
ORDER BY column1
Next, I will break up the query into individual parts and explain what each means.
Every query uses the verb “SELECT”“ to specify columns. Most basic queries begin with "SELECT” as well. What follows “SELECT” are usually the column names from a table or an asterisk ('*'). Let's say, for example, that we want to look at every column of every product from the inventory table called “inv”. In a query, it would look like this…
SELECT *
FROM inv
Simple, right? The asterisk selects all of the columns from a table. More often than not though this will result in a huge, messy dataset that will add a few steps to the data analysis process because 1) it will take a longer time to compile more results and 2) you will probably need only a few columns, not all of them. Because of this, “SELECT *” isn't usually the optimal solution. Instead, you can select multiple columns or attributes if you explicitly include them in your query. If I were to query the inventory table to find out all the possible product categories my query would look like this…
SELECT CATEG_COD
FROM inv