1 Advanced filters

1.1 filtering text data in SQL
1.2 Basic arithmetic

2 SQL subqueries

library(tidyverse)
library(odbc)
library(DBI)
library(RSQLite)
## read in the dataset
df <- readr::read_csv("recipe_site_traffic_2212.csv")

## sample 100 observations and select first 3 variables
set.seed(1123)
data1<- df |>
  sample_n(size=25) |> 
  select(1,2,3,6)

## sample 100 observations and select subsequent 3 variables(including the first(ID))
## 
set.seed(1123)
data2<- df |>
  sample_n(size=25) |> 
  select(1,2,4,5,6)

0.1 Create a database

con<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con,data1)
copy_to(con,data2)

0.1.1 filtering using `WHERE` in SQL

WHERE is a filtering clause
In SQL, the WHERE keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use:
= equal
<> not equal
< less than
> greater than
<= less than or equal to
>= greater than or equal to
You can build up your WHERE queries by combining multiple conditions with the AND keyword.

SELECT * 
FROM data2 
WHERE category='Potato'

1 records
recipe	calories	sugar	protein	category
624	236.62	0.81	9.07	Potato

0.2 filtering using `IN` and `WHERE` in SQL

SELECT * 
FROM data2 
WHERE category IN ('Chicken','Chicken Breast')

5 records
recipe	calories	sugar	protein	category
609	NA	NA	NA	Chicken Breast
671	481.88	1.34	51.90	Chicken
683	147.24	0.94	54.00	Chicken
419	1830.28	1.83	44.74	Chicken Breast
117	NA	NA	NA	Chicken Breast

0.2.1 `COUNT` and `AS` alias in SQL

The COUNT statement lets you count then returning the number of rows in one or more columns.
we use AS to rename default name
COUNT(*) tells you how many rows are in a table

SELECT COUNT(*) AS n 
FROM data2

1 records
n
25

0.2.2 count filters in SQL

SELECT COUNT(category) AS n_potato
FROM data2 
WHERE category IN ('Chicken Breast','Chicken')

1 records
n_potato
5

0.2.3 Updating a database using `dbExecute`

categories before

ABCDEFGHIJ0123456789

category <chr>	n <int>	percent <dbl>
Beverages	1	0.04
Breakfast	4	0.16
Chicken	2	0.08
Chicken Breast	3	0.12
Dessert	3	0.12
Meat	4	0.16
One Dish Meal	3	0.12
Pork	1	0.04
Potato	1	0.04
Vegetable	3	0.12

0.2.4 Updating a database

we need to change a category called Breakfast to Breakfast meal
we shall use REPLACE

UPDATE data2 
SET category= REPLACE(category,'Breakfast','Breakfast meal')
WHERE category='Breakfast'

0.2.5 Querying to check results

SELECT category, COUNT(category) as n_per_category
FROM data2
GROUP BY category

Displaying records 1 - 10
category	n_per_category
Beverages	1
Breakfast meal	4
Chicken	2
Chicken Breast	3
Dessert	3
Meat	4
One Dish Meal	3
Pork	1
Potato	1
Vegetable	3

we see that Breakfast has updated to Breakfast meal

0.3 Missing data in SQL

in SQL, NULL represents a missing or unknown value. You can check for NULL values using the expression IS NULL
Use IS NULL AND IS NOT NULL

SELECT * 
FROM data2 
WHERE Protein IS NULL

3 records
recipe	calories	sugar	protein	category
912	NA	NA	NA	Dessert
609	NA	NA	NA	Chicken Breast
117	NA	NA	NA	Chicken Breast

SELECT * 
FROM data2 
WHERE Protein IS NOT NULL

Displaying records 1 - 10
recipe	calories	sugar	protein	category
624	236.62	0.81	9.07	Potato
102	198.98	0.39	39.12	Vegetable
427	187.41	86.97	4.49	Dessert
324	49.50	4.69	53.33	Breakfast meal
380	75.89	8.18	17.64	Meat
671	481.88	1.34	51.90	Chicken
683	147.24	0.94	54.00	Chicken
208	20.98	9.80	0.82	Beverages
633	105.95	0.07	7.55	Vegetable
859	388.44	4.62	13.93	Pork

0.4 How many categories do we have

0.4.1 select `distinct` items in SQL

Often your results will include many duplicate values. If you want to select all the unique values from a column, you can use the DISTINCT keyword.

SELECT COUNT(DISTINCT category) AS unique_categories 
FROM data2

1 records
unique_categories
10

1 Advanced filters

1.0.1 filtering in SQL

SELECT * 
FROM data2 WHERE sugar > 1 AND sugar < 5 
AND category='Breakfast meal'

3 records
recipe	calories	sugar	protein	category
324	49.5	4.69	53.33	Breakfast meal
598	212.4	2.25	23.78	Breakfast meal
526	93.5	2.57	23.68	Breakfast meal

also achieved by using BETWEEN and AND

SELECT * 
FROM data2 
WHERE sugar BETWEEN 1 AND 5 
AND category='Breakfast meal'

3 records
recipe	calories	sugar	protein	category
324	49.5	4.69	53.33	Breakfast meal
598	212.4	2.25	23.78	Breakfast meal
526	93.5	2.57	23.68	Breakfast meal

1.1 filtering text data in SQL

use LIKE
the LIKE operator can be used in a WHERE clause to search for a pattern in a column.
To accomplish this, you use something called a wildcard as a placeholder for some other values. There are two wildcards you can use with LIKE:

The % wildcard will match zero, one, or many characters in text

The _ wildcard will match a single character

SELECT * 
FROM data2 
WHERE category LIKE 'Chicken%'

5 records
recipe	calories	sugar	protein	category
609	NA	NA	NA	Chicken Breast
671	481.88	1.34	51.90	Chicken
683	147.24	0.94	54.00	Chicken
419	1830.28	1.83	44.74	Chicken Breast
117	NA	NA	NA	Chicken Breast

1.1.1 Aggregate functions in SQL

SELECT AVG(sugar) AS avg_sugar,
       MAX(sugar) AS max_sugar,
       MIN(sugar) AS min_sugar,
       VARIANCE(sugar) AS variance,
       STDEV(sugar) AS stnd_deviation
FROM data2;

1 records
avg_sugar	max_sugar	min_sugar	variance	stnd_deviation
11.24864	86.97	0.07	426.5449	20.65296

1.1.2 Grouping and aggregating in SQL

SELECT category,AVG(sugar) AS avg_sugar,
                       MAX(sugar) AS max_sugar,
                       MIN(sugar) AS min_sugar
FROM data2
GROUP BY category;

Displaying records 1 - 10
category	avg_sugar	max_sugar	min_sugar
Beverages	9.8000000	9.80	9.80
Breakfast meal	5.1600000	11.13	2.25
Chicken	1.1400000	1.34	0.94
Chicken Breast	1.8300000	1.83	1.83
Dessert	69.5700000	86.97	52.17
Meat	13.5525000	30.06	5.21
One Dish Meal	3.9066667	6.09	1.46
Pork	4.6200000	4.62	4.62
Potato	0.8100000	0.81	0.81
Vegetable	0.8066667	1.96	0.07

we can order the data by a certain column using ORDER BY clause

SELECT category,AVG(sugar) AS avg_sugar,
                       MAX(sugar) AS max_sugar,
                       MIN(sugar) AS min_sugar
FROM data2
GROUP BY category
ORDER BY avg_sugar;

Displaying records 1 - 10
category	avg_sugar	max_sugar	min_sugar
Vegetable	0.8066667	1.96	0.07
Potato	0.8100000	0.81	0.81
Chicken	1.1400000	1.34	0.94
Chicken Breast	1.8300000	1.83	1.83
One Dish Meal	3.9066667	6.09	1.46
Pork	4.6200000	4.62	4.62
Breakfast meal	5.1600000	11.13	2.25
Beverages	9.8000000	9.80	9.80
Meat	13.5525000	30.06	5.21
Dessert	69.5700000	86.97	52.17

the above output shows results arranged in ascending order of avg_sugar , if we needed to begin with the largest descending we could useORDER BY...DESC

SELECT category,AVG(sugar) AS avg_sugar,
                       MAX(sugar) AS max_sugar,
                       MIN(sugar) AS min_sugar
FROM data2
GROUP BY category
ORDER BY avg_sugar DESC;

Displaying records 1 - 10
category	avg_sugar	max_sugar	min_sugar
Dessert	69.5700000	86.97	52.17
Meat	13.5525000	30.06	5.21
Beverages	9.8000000	9.80	9.80
Breakfast meal	5.1600000	11.13	2.25
Pork	4.6200000	4.62	4.62
One Dish Meal	3.9066667	6.09	1.46
Chicken Breast	1.8300000	1.83	1.83
Chicken	1.1400000	1.34	0.94
Potato	0.8100000	0.81	0.81
Vegetable	0.8066667	1.96	0.07

1.1.3 Using `UPPER()

Sometimes you may want to change a column to all upper cases

SELECT recipe,category , UPPER(category) AS upper_cat
FROM data2;

Displaying records 1 - 10
recipe	category	upper_cat
624	Potato	POTATO
102	Vegetable	VEGETABLE
427	Dessert	DESSERT
324	Breakfast meal	BREAKFAST MEAL
912	Dessert	DESSERT
380	Meat	MEAT
609	Chicken Breast	CHICKEN BREAST
671	Chicken	CHICKEN
683	Chicken	CHICKEN
208	Beverages	BEVERAGES

1.2 Basic arithmetic

In addition to using aggregate functions, you can perform basic arithmetic with symbols like +, -, *, and /.
lets create weird variables here

SELECT category,(sugar-protein) AS diff_sugar_protein 
FROM data2;

Displaying records 1 - 10
category	diff_sugar_protein
Potato	-8.26
Vegetable	-38.73
Dessert	82.48
Breakfast meal	-48.64
Dessert	NA
Meat	-9.46
Chicken Breast	NA
Chicken	-50.56
Chicken	-53.06
Beverages	8.98

1.2.1 Take the `CASE`

using CASE WHEN to categorise or group data as follows
SELECT….
CASE WHEN condition THEN category
WHEN condition THEN category
…………………………………..
ELSE some_category
END AS variable name

SELECT recipe , sugar,
   CASE WHEN sugar BETWEEN 0 AND 50 THEN '0-50'
        WHEN sugar BETWEEN 51 AND 100 THEN '51-100'
        WHEN sugar BETWEEN 101 AND 200 THEN '101-200'
        WHEN sugar IS NULL THEN 'missing'
        ELSE '>200'
        END AS sugar_category
FROM data2;

Displaying records 1 - 10
recipe	sugar	sugar_category
624	0.81	0-50
102	0.39	0-50
427	86.97	51-100
324	4.69	0-50
912	NA	missing
380	8.18	0-50
609	NA	missing
671	1.34	0-50
683	0.94	0-50
208	9.80	0-50

we can letter perform aggregations on the new column for instance get the count per each category

SELECT COUNT(*) AS count_per_category,
   CASE WHEN sugar BETWEEN 0 AND 50 THEN '0-50'
        WHEN sugar BETWEEN 51 AND 100 THEN '51-100'
        WHEN sugar BETWEEN 101 AND 200 THEN '101-200'
        WHEN sugar IS NULL THEN 'missing'
        ELSE '>200'
        END AS sugar_category
FROM data2
GROUP BY sugar_category;

3 records
count_per_category	sugar_category
20	0-50
2	51-100
3	missing

2 SQL subqueries

use data1 and data2 databases for this task
the following is just an example and calories in data2 and data1 are the same so the result is the same

SELECT recipe,calories
FROM data2 
WHERE calories > 
      (SELECT AVG(calories) FROM data1);

9 records
recipe	calories
671	481.88
859	388.44
171	431.28
282	409.99
91	388.37
868	307.36
885	504.09
419	1830.28
639	321.95

2.0.1 Summarize group statistics using subqueries

Sometimes you want to understand how a value varies across groups. For example, how does the maximum value per group vary across groups?

To find out, first summarize by group, and then compute summary statistics of the group results. One way to do this is to compute group values in a subquery, and then summarize the results of the subquery.

what is the standard deviation across meal cateries in the maximum amount of sugar? What about the mean, min, and max of the maximums as well?

Start by writing a subquery to compute the max() per categoy; alias the subquery result as sugar_max. Then compute the standard deviation of sugar_max with stddev(). Compute the min(), max(), and avg() of sugar_max too.

SELECT AVG(sugar_max) AS avg_sugar,
       MAX(sugar_max) AS max_sugar,
       MIN(sugar_max) AS min_sugar,
       VARIANCE(sugar_max) AS variance,
       STDEV(sugar_max) AS stnd_deviation
FROM (SELECT MAX(sugar) AS sugar_max
      FROM data2
         -- Compute max by...
      GROUP BY category) AS max_results

1 records
avg_sugar	max_sugar	min_sugar	variance	stnd_deviation
15.461	86.97	0.81	707.2168	26.59355

Data Manipulation and Aggregation in SQL

Bongani Ncube

11/20/2023

0.1 Create a database

0.1.1 filtering using `WHERE` in SQL

0.2 filtering using `IN` and `WHERE` in SQL

0.2.1 `COUNT` and `AS` alias in SQL

0.2.2 count filters in SQL

0.2.3 Updating a database using `dbExecute`

0.2.4 Updating a database

0.2.5 Querying to check results

0.3 Missing data in SQL

0.4 How many categories do we have

0.4.1 select `distinct` items in SQL

1 Advanced filters

1.0.1 filtering in SQL

1.1 filtering text data in SQL

1.1.1 Aggregate functions in SQL

1.1.2 Grouping and aggregating in SQL

1.1.3 Using `UPPER()

1.2 Basic arithmetic

1.2.1 Take the `CASE`

2 SQL subqueries

2.0.1 Summarize group statistics using subqueries

Data Manipulation and Aggregation in SQL

Bongani Ncube

11/20/2023

0.1 Create a database

0.1.1 filtering using WHERE in SQL

0.2 filtering using IN and WHERE in SQL

0.2.1 COUNT and AS alias in SQL

0.2.2 count filters in SQL

0.2.3 Updating a database using dbExecute

0.2.4 Updating a database

0.2.5 Querying to check results

0.3 Missing data in SQL

0.4 How many categories do we have

0.4.1 select distinct items in SQL

1 Advanced filters

1.0.1 filtering in SQL

1.1 filtering text data in SQL

1.1.1 Aggregate functions in SQL

1.1.2 Grouping and aggregating in SQL

1.1.3 Using `UPPER()

1.2 Basic arithmetic

1.2.1 Take the CASE

2 SQL subqueries

2.0.1 Summarize group statistics using subqueries

0.1.1 filtering using `WHERE` in SQL

0.2 filtering using `IN` and `WHERE` in SQL

0.2.1 `COUNT` and `AS` alias in SQL

0.2.3 Updating a database using `dbExecute`

0.4.1 select `distinct` items in SQL

1.2.1 Take the `CASE`