0.1 Introduction

We do realise that most of the softwares around are designed to solve a problem!. To better understand what databases are ,we need to first understand the problems that necessitated the advent of databases. The answer to this is quite obvious ! you have some data or large amounts of information that you need to store, this data could be about:

  • Customers
  • Products
  • Employees etc

But why worry? , we could store the data in text files or even spreadsheets or if they are documents we can just organise them in folders…. is it! . Just having data is not a good enough reason to have a database , having data is not the problem but the problem is what comes next. Below are some of the problems that a database can solve

  • size of the data
  • ease of updating
  • accuracy
  • sensitivity
  • redundancy
  • importance

so what`s a database?

0.1.1 What is a database?

A structured storage space where the data is kept in many tables and organized so that the necessary information can be easily fetched, manipulated, and summarized.

0.1.2 tables and fields

A table is an organized set of related data stored in a tabular form, i.e., in rows and columns. A field is another term for a column of a table and record is the other name for a row

0.1.3 Relational Data

Relational data are related by unique identifiers

To better understand this, let’s consider a scenario were Bongani Has a business. say I have a number of different restaurants. In one table I might have information about these restaurants including, where they are located and what type of food they serve. I may then have a second table where information about health and safety inspections is stored. Each inspection is a different row and the date of the inspection, the inspector, and the safety rating are stored in this table. Finally, I might have a third table. This third table contains information pulled from an API, regarding the number of stars given to each restaurant, as rated by people online. Each table contains different bits of information; however, there is a common column id in each of the tables. This allows the information to be linked between the tables. The restaurant with the id “JJ29JJ” in the restaurant table would refer to the same restaurant with the id “JJ29JJ” in the health inspections table, and so on. The values in this id column are known as unique identifiers because they uniquely identify each restaurant. No two restaurants will have the same id, and the same restaurant will always have the same id, no matter what table you’re looking at. The fact that these tables have unique identifiers connecting each table to all the other tables makes this example what we call relational data.

Unique identifiers help link entries across tables

0.1.4 Why relational data?

Storing data in this way has a number of advantages; however, the three most important are:

  1. Efficient Data Storage
  2. Avoids Ambiguity
  3. Privacy

0.1.5 DBMS

DDMS stands for Database Management System, a software package used to perform various operations on the data stored in a database, such as accessing, updating, wrangling, inserting, and removing data. There are various types of DBMS, such as relational, hierarchical, network, graph, or object-oriented. These types are based on the way the data is organized, structured, and stored in the system.

0.1.6 RDBMS

RDBMS stands for Relational Database Management System. It’s the most common type of DBMS used for working with data stored in multiple tables related to each other by means of shared keys. The SQL programming language is particularly designed to interact with RDBMS. Some examples of RDBMS are MySQL, PostgreSQL, Oracle, MariaDB, etc.

0.1.7 SQL constraints

constraints Description
DEFAULT provides a default value for a column
UNIQUE allows only unique values
NOT NULL allows only non-null values.
PRIMARY KEY allows only unique and strictly non-null values (NOT NULL and UNIQUE).
FOREIGN KEY provides shared keys between two and more tables

0.1.8 unique key

A column (or multiple columns) of a table to which the UNIQUE constraint was imposed to ensure unique values in that column, including a possible NULL value (the only one).

0.1.9 foreign key

A column (or multiple columns) of a table to which the FOREIGN KEY constraint was imposed to link this column to the primary key in another table (or several tables). The purpose of foreign keys is to keep connected various tables of a database.

0.1.10 SQL queries

A query is a piece of code written in SQL to access the data from a database or to modify the data. Correspondingly, there are two types of SQL queries: select and action queries. The first ones are used to retrieve the necessary data (this also includes limiting, grouping, ordering the data, extracting the data from multiple tables, etc.), while the second ones are used to create, add, delete, update, rename the data, etc.

0.1.11 SQL comments

A human-readable clarification on what a particular piece of code does. SQL code comments can be single-line (preceded by a double dash –) or span over multiple lines (as follows: /comment_text/). When the SQL engine runs, it ignores code comments. The purpose of adding SQL code comments is to make the code more comprehensive for those people who will read it in the future.

0.1.12 types of SQL commands (or SQL subsets)

0.1.13 creating a table

any computer system that deals with storing data needs to have the four fundamental functions that have the ability to:

  • C-REATE
  • R-EAD
  • U-UPDATE
  • D-ELETE

commonly pronounced as CRUD

we use the CREATE TABLE statement.

CREATE TABLE table_name (col_1 datatype,
                           col_2 datatype,
                           col_3 datatype);

for instance if run the following query !

CREATE TABLE my_table (age INTERGER,
                       sex TEXT);

we add the values to the table using INSERT INTO .... VALUES (....)

INSERT INTO my_table VALUES (34,'male');
INSERT INTO my_table VALUES (47,'female');
INSERT INTO my_table VALUES (32,'male');   

then call the database using SELECT

SELECT * FROM my_table;

0.1.14 Querying a database

  • we will use the Employee attrition database
  • A query is a request for data from a database table (or combination of tables).
  • Querying is an essential skill for a data scientist, since the data you need for your analyses will often live in databases.

SELECT statement

select all with *

  • In SQL, you can select data from a table using a SELECT statement. For example, the following query selects the name column from the people table: SELECT name FROM people;
  • you may want to select all columns from a table use * rather than typing all the columns
SELECT * FROM dat_new

selecting a subset

  • you can achieve this by defining the columns to be selected e.g
SELECT col_1,col_2,col_3 
FROM my_table;

using the database we get

SELECT Attrition,Department,Education 
FROM dat_new;

select distinct items in SQL

  • Often your results will include many duplicate values. If you want to select all the unique values from a column, you can use the DISTINCT keyword. for instance if you have the following database table:
  • we can see that Bongani has been recorded twice in the dataset
SELECT DISTINCT name,age,sex 
FROM database_table

filtering using WHERE in SQL

  • WHERE is a filtering clause

  • In SQL, the WHERE keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use:

  • = equal

  • <> not equal

  • < less than

  • > greater than

  • <= less than or equal to

  • >= greater than or equal to

  • You can build up your WHERE queries by combining multiple conditions with the AND keyword.

SELECT DISTINCT name,age,sex
FROM df
WHERE sex='male';

COUNT and AS alias in SQL

  • The COUNT statement lets you count then returning the number of rows in one or more columns.
  • we use AS to rename default name
  • COUNT(*) tells you how many rows are in a table
SELECT COUNT(*) AS pple_count
FROM database_table
  • the above query calculates the number of rows in the data and finds 5 rows

we can use COUNT(DISTINCT column) to get the number of unique rows

SELECT COUNT(DISTINCT name) AS unique_count
FROM database_table

the query above will calculate rows with distinct names and finds them to be 4

We can use COUNT with WHERE

  • We can do this to get the number of terms after filtering
  • how many people are male?
SELECT COUNT(*) AS num_male
FROM database_table
WHERE sex='male';

here we see that we have 3 males although the other one has been duplicated