• 1 Advanced filters
    • 1.1 filtering text data in SQL
    • 1.2 Basic arithmetic
  • 2 SQL subqueries
library(tidyverse)
library(odbc)
library(DBI)
library(RSQLite)
## read in the dataset
df <- readr::read_csv("recipe_site_traffic_2212.csv")

## sample 100 observations and select first 3 variables
set.seed(1123)
data1<- df |>
  sample_n(size=25) |> 
  select(1,2,3,6)

## sample 100 observations and select subsequent 3 variables(including the first(ID))
## 
set.seed(1123)
data2<- df |>
  sample_n(size=25) |> 
  select(1,2,4,5,6)

0.1 Create a database

con<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con,data1)
copy_to(con,data2)

0.1.1 filtering using WHERE in SQL

  • WHERE is a filtering clause

  • In SQL, the WHERE keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use:

  • = equal

  • <> not equal

  • < less than

  • > greater than

  • <= less than or equal to

  • >= greater than or equal to

  • You can build up your WHERE queries by combining multiple conditions with the AND keyword.

SELECT * 
FROM data2 
WHERE category='Potato'
1 records
recipe calories sugar protein category
624 236.62 0.81 9.07 Potato

0.2 filtering using IN and WHERE in SQL

SELECT * 
FROM data2 
WHERE category IN ('Chicken','Chicken Breast')
5 records
recipe calories sugar protein category
609 NA NA NA Chicken Breast
671 481.88 1.34 51.90 Chicken
683 147.24 0.94 54.00 Chicken
419 1830.28 1.83 44.74 Chicken Breast
117 NA NA NA Chicken Breast

0.2.1 COUNT and AS alias in SQL

  • The COUNT statement lets you count then returning the number of rows in one or more columns.
  • we use AS to rename default name
  • COUNT(*) tells you how many rows are in a table
SELECT COUNT(*) AS n 
FROM data2
1 records
n
25

0.2.2 count filters in SQL

SELECT COUNT(category) AS n_potato
FROM data2 
WHERE category IN ('Chicken Breast','Chicken')
1 records
n_potato
5

0.2.3 Updating a database using dbExecute

  • categories before
ABCDEFGHIJ0123456789
category
<chr>
n
<int>
percent
<dbl>
Beverages10.04
Breakfast40.16
Chicken20.08
Chicken Breast30.12
Dessert30.12
Meat40.16
One Dish Meal30.12
Pork10.04
Potato10.04
Vegetable30.12

0.2.4 Updating a database

  • we need to change a category called Breakfast to Breakfast meal
  • we shall use REPLACE
UPDATE data2 
SET category= REPLACE(category,'Breakfast','Breakfast meal')
WHERE category='Breakfast'

0.2.5 Querying to check results

SELECT category, COUNT(category) as n_per_category
FROM data2
GROUP BY category
Displaying records 1 - 10
category n_per_category
Beverages 1
Breakfast meal 4
Chicken 2
Chicken Breast 3
Dessert 3
Meat 4
One Dish Meal 3
Pork 1
Potato 1
Vegetable 3
  • we see that Breakfast has updated to Breakfast meal

0.3 Missing data in SQL

  • in SQL, NULL represents a missing or unknown value. You can check for NULL values using the expression IS NULL
  • Use IS NULL AND IS NOT NULL
SELECT * 
FROM data2 
WHERE Protein IS NULL
3 records
recipe calories sugar protein category
912 NA NA NA Dessert
609 NA NA NA Chicken Breast
117 NA NA NA Chicken Breast
SELECT * 
FROM data2 
WHERE Protein IS NOT NULL
Displaying records 1 - 10
recipe calories sugar protein category
624 236.62 0.81 9.07 Potato
102 198.98 0.39 39.12 Vegetable
427 187.41 86.97 4.49 Dessert
324 49.50 4.69 53.33 Breakfast meal
380 75.89 8.18 17.64 Meat
671 481.88 1.34 51.90 Chicken
683 147.24 0.94 54.00 Chicken
208 20.98 9.80 0.82 Beverages
633 105.95 0.07 7.55 Vegetable
859 388.44 4.62 13.93 Pork

0.4 How many categories do we have

0.4.1 select distinct items in SQL

  • Often your results will include many duplicate values. If you want to select all the unique values from a column, you can use the DISTINCT keyword.
SELECT COUNT(DISTINCT category) AS unique_categories 
FROM data2
1 records
unique_categories
10

1 Advanced filters

1.0.1 filtering in SQL

SELECT * 
FROM data2 WHERE sugar > 1 AND sugar < 5 
AND category='Breakfast meal'
3 records
recipe calories sugar protein category
324 49.5 4.69 53.33 Breakfast meal
598 212.4 2.25 23.78 Breakfast meal
526 93.5 2.57 23.68 Breakfast meal
  • also achieved by using BETWEEN and AND
SELECT * 
FROM data2 
WHERE sugar BETWEEN 1 AND 5 
AND category='Breakfast meal'
3 records
recipe calories sugar protein category
324 49.5 4.69 53.33 Breakfast meal
598 212.4 2.25 23.78 Breakfast meal
526 93.5 2.57 23.68 Breakfast meal

1.1 filtering text data in SQL

  • use LIKE
  • the LIKE operator can be used in a WHERE clause to search for a pattern in a column.
  • To accomplish this, you use something called a wildcard as a placeholder for some other values. There are two wildcards you can use with LIKE:

The % wildcard will match zero, one, or many characters in text

The _ wildcard will match a single character

SELECT * 
FROM data2 
WHERE category LIKE 'Chicken%'
5 records
recipe calories sugar protein category
609 NA NA NA Chicken Breast
671 481.88 1.34 51.90 Chicken
683 147.24 0.94 54.00 Chicken
419 1830.28 1.83 44.74 Chicken Breast
117 NA NA NA Chicken Breast

1.1.1 Aggregate functions in SQL

SELECT AVG(sugar) AS avg_sugar,
       MAX(sugar) AS max_sugar,
       MIN(sugar) AS min_sugar,
       VARIANCE(sugar) AS variance,
       STDEV(sugar) AS stnd_deviation
FROM data2;
1 records
avg_sugar max_sugar min_sugar variance stnd_deviation
11.24864 86.97 0.07 426.5449 20.65296

1.1.2 Grouping and aggregating in SQL

SELECT category,AVG(sugar) AS avg_sugar,
                       MAX(sugar) AS max_sugar,
                       MIN(sugar) AS min_sugar
FROM data2
GROUP BY category;
Displaying records 1 - 10
category avg_sugar max_sugar min_sugar
Beverages 9.8000000 9.80 9.80
Breakfast meal 5.1600000 11.13 2.25
Chicken 1.1400000 1.34 0.94
Chicken Breast 1.8300000 1.83 1.83
Dessert 69.5700000 86.97 52.17
Meat 13.5525000 30.06 5.21
One Dish Meal 3.9066667 6.09 1.46
Pork 4.6200000 4.62 4.62
Potato 0.8100000 0.81 0.81
Vegetable 0.8066667 1.96 0.07

we can order the data by a certain column using ORDER BY clause

SELECT category,AVG(sugar) AS avg_sugar,
                       MAX(sugar) AS max_sugar,
                       MIN(sugar) AS min_sugar
FROM data2
GROUP BY category
ORDER BY avg_sugar;
Displaying records 1 - 10
category avg_sugar max_sugar min_sugar
Vegetable 0.8066667 1.96 0.07
Potato 0.8100000 0.81 0.81
Chicken 1.1400000 1.34 0.94
Chicken Breast 1.8300000 1.83 1.83
One Dish Meal 3.9066667 6.09 1.46
Pork 4.6200000 4.62 4.62
Breakfast meal 5.1600000 11.13 2.25
Beverages 9.8000000 9.80 9.80
Meat 13.5525000 30.06 5.21
Dessert 69.5700000 86.97 52.17

the above output shows results arranged in ascending order of avg_sugar , if we needed to begin with the largest descending we could useORDER BY...DESC

SELECT category,AVG(sugar) AS avg_sugar,
                       MAX(sugar) AS max_sugar,
                       MIN(sugar) AS min_sugar
FROM data2
GROUP BY category
ORDER BY avg_sugar DESC;
Displaying records 1 - 10
category avg_sugar max_sugar min_sugar
Dessert 69.5700000 86.97 52.17
Meat 13.5525000 30.06 5.21
Beverages 9.8000000 9.80 9.80
Breakfast meal 5.1600000 11.13 2.25
Pork 4.6200000 4.62 4.62
One Dish Meal 3.9066667 6.09 1.46
Chicken Breast 1.8300000 1.83 1.83
Chicken 1.1400000 1.34 0.94
Potato 0.8100000 0.81 0.81
Vegetable 0.8066667 1.96 0.07

1.1.3 Using `UPPER()

Sometimes you may want to change a column to all upper cases

SELECT recipe,category , UPPER(category) AS upper_cat
FROM data2;
Displaying records 1 - 10
recipe category upper_cat
624 Potato POTATO
102 Vegetable VEGETABLE
427 Dessert DESSERT
324 Breakfast meal BREAKFAST MEAL
912 Dessert DESSERT
380 Meat MEAT
609 Chicken Breast CHICKEN BREAST
671 Chicken CHICKEN
683 Chicken CHICKEN
208 Beverages BEVERAGES

1.2 Basic arithmetic

  • In addition to using aggregate functions, you can perform basic arithmetic with symbols like +, -, *, and /.
  • lets create weird variables here
SELECT category,(sugar-protein) AS diff_sugar_protein 
FROM data2;
Displaying records 1 - 10
category diff_sugar_protein
Potato -8.26
Vegetable -38.73
Dessert 82.48
Breakfast meal -48.64
Dessert NA
Meat -9.46
Chicken Breast NA
Chicken -50.56
Chicken -53.06
Beverages 8.98

1.2.1 Take the CASE

  • using CASE WHEN to categorise or group data as follows
  • SELECT….
  • CASE WHEN condition THEN category
  • WHEN condition THEN category
  • …………………………………..
  • ELSE some_category
  • END AS variable name
SELECT recipe , sugar,
   CASE WHEN sugar BETWEEN 0 AND 50 THEN '0-50'
        WHEN sugar BETWEEN 51 AND 100 THEN '51-100'
        WHEN sugar BETWEEN 101 AND 200 THEN '101-200'
        WHEN sugar IS NULL THEN 'missing'
        ELSE '>200'
        END AS sugar_category
FROM data2;
Displaying records 1 - 10
recipe sugar sugar_category
624 0.81 0-50
102 0.39 0-50
427 86.97 51-100
324 4.69 0-50
912 NA missing
380 8.18 0-50
609 NA missing
671 1.34 0-50
683 0.94 0-50
208 9.80 0-50

we can letter perform aggregations on the new column for instance get the count per each category

SELECT COUNT(*) AS count_per_category,
   CASE WHEN sugar BETWEEN 0 AND 50 THEN '0-50'
        WHEN sugar BETWEEN 51 AND 100 THEN '51-100'
        WHEN sugar BETWEEN 101 AND 200 THEN '101-200'
        WHEN sugar IS NULL THEN 'missing'
        ELSE '>200'
        END AS sugar_category
FROM data2
GROUP BY sugar_category;
3 records
count_per_category sugar_category
20 0-50
2 51-100
3 missing

2 SQL subqueries

  • use data1 and data2 databases for this task
  • the following is just an example and calories in data2 and data1 are the same so the result is the same
SELECT recipe,calories
FROM data2 
WHERE calories > 
      (SELECT AVG(calories) FROM data1);
9 records
recipe calories
671 481.88
859 388.44
171 431.28
282 409.99
91 388.37
868 307.36
885 504.09
419 1830.28
639 321.95

2.0.1 Summarize group statistics using subqueries

Sometimes you want to understand how a value varies across groups. For example, how does the maximum value per group vary across groups?

To find out, first summarize by group, and then compute summary statistics of the group results. One way to do this is to compute group values in a subquery, and then summarize the results of the subquery.

what is the standard deviation across meal cateries in the maximum amount of sugar? What about the mean, min, and max of the maximums as well?

Start by writing a subquery to compute the max() per categoy; alias the subquery result as sugar_max. Then compute the standard deviation of sugar_max with stddev(). Compute the min(), max(), and avg() of sugar_max too.

SELECT AVG(sugar_max) AS avg_sugar,
       MAX(sugar_max) AS max_sugar,
       MIN(sugar_max) AS min_sugar,
       VARIANCE(sugar_max) AS variance,
       STDEV(sugar_max) AS stnd_deviation
FROM (SELECT MAX(sugar) AS sugar_max
      FROM data2
         -- Compute max by...
      GROUP BY category) AS max_results
1 records
avg_sugar max_sugar min_sugar variance stnd_deviation
15.461 86.97 0.81 707.2168 26.59355