library(tidyverse)
library(odbc)
library(DBI)
library(RSQLite)
## read in the dataset
df <- readr::read_csv("recipe_site_traffic_2212.csv")
## sample 100 observations and select first 3 variables
set.seed(1123)
data1<- df |>
sample_n(size=25) |>
select(1,2,3,6)
## sample 100 observations and select subsequent 3 variables(including the first(ID))
##
set.seed(1123)
data2<- df |>
sample_n(size=25) |>
select(1,2,4,5,6)
Create a
database
con<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con,data1)
copy_to(con,data2)
filtering using
WHERE
in SQL
WHERE is a filtering clause
In SQL, the WHERE keyword allows you to filter based on both text
and numeric values in a table. There are a few different comparison
operators you can use:
= equal
<> not equal
< less than
> greater than
<= less than or equal to
>= greater than or equal to
You can build up your WHERE queries by combining multiple
conditions with the AND keyword.
SELECT *
FROM data2
WHERE category='Potato'
1 records
624 |
236.62 |
0.81 |
9.07 |
Potato |
filtering using
IN
and WHERE
in SQL
SELECT *
FROM data2
WHERE category IN ('Chicken','Chicken Breast')
5 records
609 |
NA |
NA |
NA |
Chicken Breast |
671 |
481.88 |
1.34 |
51.90 |
Chicken |
683 |
147.24 |
0.94 |
54.00 |
Chicken |
419 |
1830.28 |
1.83 |
44.74 |
Chicken Breast |
117 |
NA |
NA |
NA |
Chicken Breast |
COUNT
and AS
alias in SQL
- The COUNT statement lets you count then returning the
number of rows in one or more columns.
- we use
AS
to rename default name
COUNT(*)
tells you how many rows are in a table
SELECT COUNT(*) AS n
FROM data2
count filters in
SQL
SELECT COUNT(category) AS n_potato
FROM data2
WHERE category IN ('Chicken Breast','Chicken')
Updating a database
using dbExecute
Updating a
database
- we need to change a category called
Breakfast
to
Breakfast meal
- we shall use
REPLACE
UPDATE data2
SET category= REPLACE(category,'Breakfast','Breakfast meal')
WHERE category='Breakfast'
Querying to check
results
SELECT category, COUNT(category) as n_per_category
FROM data2
GROUP BY category
Displaying records 1 - 10
Beverages |
1 |
Breakfast meal |
4 |
Chicken |
2 |
Chicken Breast |
3 |
Dessert |
3 |
Meat |
4 |
One Dish Meal |
3 |
Pork |
1 |
Potato |
1 |
Vegetable |
3 |
- we see that
Breakfast
has updated to
Breakfast meal
Missing data in
SQL
- in SQL,
NULL
represents a missing or unknown value. You
can check for NULL values using the expression IS NULL
- Use
IS NULL
AND IS NOT NULL
SELECT *
FROM data2
WHERE Protein IS NULL
3 records
912 |
NA |
NA |
NA |
Dessert |
609 |
NA |
NA |
NA |
Chicken Breast |
117 |
NA |
NA |
NA |
Chicken Breast |
SELECT *
FROM data2
WHERE Protein IS NOT NULL
Displaying records 1 - 10
624 |
236.62 |
0.81 |
9.07 |
Potato |
102 |
198.98 |
0.39 |
39.12 |
Vegetable |
427 |
187.41 |
86.97 |
4.49 |
Dessert |
324 |
49.50 |
4.69 |
53.33 |
Breakfast meal |
380 |
75.89 |
8.18 |
17.64 |
Meat |
671 |
481.88 |
1.34 |
51.90 |
Chicken |
683 |
147.24 |
0.94 |
54.00 |
Chicken |
208 |
20.98 |
9.80 |
0.82 |
Beverages |
633 |
105.95 |
0.07 |
7.55 |
Vegetable |
859 |
388.44 |
4.62 |
13.93 |
Pork |
How many categories
do we have
select
distinct
items in SQL
- Often your results will include many duplicate values. If you want
to select all the unique values from a column, you can use the
DISTINCT keyword.
SELECT COUNT(DISTINCT category) AS unique_categories
FROM data2
Advanced filters
filtering in
SQL
SELECT *
FROM data2 WHERE sugar > 1 AND sugar < 5
AND category='Breakfast meal'
3 records
324 |
49.5 |
4.69 |
53.33 |
Breakfast meal |
598 |
212.4 |
2.25 |
23.78 |
Breakfast meal |
526 |
93.5 |
2.57 |
23.68 |
Breakfast meal |
- also achieved by using
BETWEEN
and
AND
SELECT *
FROM data2
WHERE sugar BETWEEN 1 AND 5
AND category='Breakfast meal'
3 records
324 |
49.5 |
4.69 |
53.33 |
Breakfast meal |
598 |
212.4 |
2.25 |
23.78 |
Breakfast meal |
526 |
93.5 |
2.57 |
23.68 |
Breakfast meal |
filtering text data
in SQL
- use
LIKE
- the
LIKE
operator can be used in a WHERE clause to
search for a pattern in a column.
- To accomplish this, you use something called a wildcard as a
placeholder for some other values. There are two wildcards you can use
with LIKE:
The %
wildcard will match zero, one, or many characters
in text
The _
wildcard will match a single character
SELECT *
FROM data2
WHERE category LIKE 'Chicken%'
5 records
609 |
NA |
NA |
NA |
Chicken Breast |
671 |
481.88 |
1.34 |
51.90 |
Chicken |
683 |
147.24 |
0.94 |
54.00 |
Chicken |
419 |
1830.28 |
1.83 |
44.74 |
Chicken Breast |
117 |
NA |
NA |
NA |
Chicken Breast |
Aggregate functions
in SQL
SELECT AVG(sugar) AS avg_sugar,
MAX(sugar) AS max_sugar,
MIN(sugar) AS min_sugar,
VARIANCE(sugar) AS variance,
STDEV(sugar) AS stnd_deviation
FROM data2;
1 records
11.24864 |
86.97 |
0.07 |
426.5449 |
20.65296 |
Grouping and
aggregating in SQL
SELECT category,AVG(sugar) AS avg_sugar,
MAX(sugar) AS max_sugar,
MIN(sugar) AS min_sugar
FROM data2
GROUP BY category;
Displaying records 1 - 10
Beverages |
9.8000000 |
9.80 |
9.80 |
Breakfast meal |
5.1600000 |
11.13 |
2.25 |
Chicken |
1.1400000 |
1.34 |
0.94 |
Chicken Breast |
1.8300000 |
1.83 |
1.83 |
Dessert |
69.5700000 |
86.97 |
52.17 |
Meat |
13.5525000 |
30.06 |
5.21 |
One Dish Meal |
3.9066667 |
6.09 |
1.46 |
Pork |
4.6200000 |
4.62 |
4.62 |
Potato |
0.8100000 |
0.81 |
0.81 |
Vegetable |
0.8066667 |
1.96 |
0.07 |
we can order the data by a certain column using ORDER BY
clause
SELECT category,AVG(sugar) AS avg_sugar,
MAX(sugar) AS max_sugar,
MIN(sugar) AS min_sugar
FROM data2
GROUP BY category
ORDER BY avg_sugar;
Displaying records 1 - 10
Vegetable |
0.8066667 |
1.96 |
0.07 |
Potato |
0.8100000 |
0.81 |
0.81 |
Chicken |
1.1400000 |
1.34 |
0.94 |
Chicken Breast |
1.8300000 |
1.83 |
1.83 |
One Dish Meal |
3.9066667 |
6.09 |
1.46 |
Pork |
4.6200000 |
4.62 |
4.62 |
Breakfast meal |
5.1600000 |
11.13 |
2.25 |
Beverages |
9.8000000 |
9.80 |
9.80 |
Meat |
13.5525000 |
30.06 |
5.21 |
Dessert |
69.5700000 |
86.97 |
52.17 |
the above output shows results arranged in ascending order of
avg_sugar
, if we needed to begin with the largest
descending we could useORDER BY...DESC
SELECT category,AVG(sugar) AS avg_sugar,
MAX(sugar) AS max_sugar,
MIN(sugar) AS min_sugar
FROM data2
GROUP BY category
ORDER BY avg_sugar DESC;
Displaying records 1 - 10
Dessert |
69.5700000 |
86.97 |
52.17 |
Meat |
13.5525000 |
30.06 |
5.21 |
Beverages |
9.8000000 |
9.80 |
9.80 |
Breakfast meal |
5.1600000 |
11.13 |
2.25 |
Pork |
4.6200000 |
4.62 |
4.62 |
One Dish Meal |
3.9066667 |
6.09 |
1.46 |
Chicken Breast |
1.8300000 |
1.83 |
1.83 |
Chicken |
1.1400000 |
1.34 |
0.94 |
Potato |
0.8100000 |
0.81 |
0.81 |
Vegetable |
0.8066667 |
1.96 |
0.07 |
Using `UPPER()
Sometimes you may want to change a column to all upper cases
SELECT recipe,category , UPPER(category) AS upper_cat
FROM data2;
Displaying records 1 - 10
624 |
Potato |
POTATO |
102 |
Vegetable |
VEGETABLE |
427 |
Dessert |
DESSERT |
324 |
Breakfast meal |
BREAKFAST MEAL |
912 |
Dessert |
DESSERT |
380 |
Meat |
MEAT |
609 |
Chicken Breast |
CHICKEN BREAST |
671 |
Chicken |
CHICKEN |
683 |
Chicken |
CHICKEN |
208 |
Beverages |
BEVERAGES |
Basic arithmetic
- In addition to using aggregate functions, you can perform basic
arithmetic with symbols like
+, -, *, and /
.
- lets create weird variables here
SELECT category,(sugar-protein) AS diff_sugar_protein
FROM data2;
Displaying records 1 - 10
Potato |
-8.26 |
Vegetable |
-38.73 |
Dessert |
82.48 |
Breakfast meal |
-48.64 |
Dessert |
NA |
Meat |
-9.46 |
Chicken Breast |
NA |
Chicken |
-50.56 |
Chicken |
-53.06 |
Beverages |
8.98 |
Take the
CASE
- using
CASE WHEN
to categorise or group data as
follows
SELECT
….
CASE WHEN
condition THEN
category
WHEN
condition THEN
category
- …………………………………..
ELSE
some_category
END AS
variable name
SELECT recipe , sugar,
CASE WHEN sugar BETWEEN 0 AND 50 THEN '0-50'
WHEN sugar BETWEEN 51 AND 100 THEN '51-100'
WHEN sugar BETWEEN 101 AND 200 THEN '101-200'
WHEN sugar IS NULL THEN 'missing'
ELSE '>200'
END AS sugar_category
FROM data2;
Displaying records 1 - 10
624 |
0.81 |
0-50 |
102 |
0.39 |
0-50 |
427 |
86.97 |
51-100 |
324 |
4.69 |
0-50 |
912 |
NA |
missing |
380 |
8.18 |
0-50 |
609 |
NA |
missing |
671 |
1.34 |
0-50 |
683 |
0.94 |
0-50 |
208 |
9.80 |
0-50 |
we can letter perform aggregations on the new column for instance get
the count per each category
SELECT COUNT(*) AS count_per_category,
CASE WHEN sugar BETWEEN 0 AND 50 THEN '0-50'
WHEN sugar BETWEEN 51 AND 100 THEN '51-100'
WHEN sugar BETWEEN 101 AND 200 THEN '101-200'
WHEN sugar IS NULL THEN 'missing'
ELSE '>200'
END AS sugar_category
FROM data2
GROUP BY sugar_category;
3 records
20 |
0-50 |
2 |
51-100 |
3 |
missing |
SQL subqueries
- use data1 and data2 databases for this task
- the following is just an example and calories in data2 and data1 are
the same so the result is the same
SELECT recipe,calories
FROM data2
WHERE calories >
(SELECT AVG(calories) FROM data1);
9 records
671 |
481.88 |
859 |
388.44 |
171 |
431.28 |
282 |
409.99 |
91 |
388.37 |
868 |
307.36 |
885 |
504.09 |
419 |
1830.28 |
639 |
321.95 |
Summarize group
statistics using subqueries
Sometimes you want to understand how a value varies across groups.
For example, how does the maximum value per group vary across
groups?
To find out, first summarize by group, and then compute summary
statistics of the group results. One way to do this is to compute group
values in a subquery, and then summarize the results of the
subquery.
what is the standard deviation across meal cateries in the maximum
amount of sugar? What about the mean, min, and max of the maximums as
well?
Start by writing a subquery to compute the max()
per
categoy; alias the subquery result as sugar_max. Then compute the
standard deviation of sugar_max with stddev()
. Compute the
min()
, max()
, and avg()
of
sugar_max too.
SELECT AVG(sugar_max) AS avg_sugar,
MAX(sugar_max) AS max_sugar,
MIN(sugar_max) AS min_sugar,
VARIANCE(sugar_max) AS variance,
STDEV(sugar_max) AS stnd_deviation
FROM (SELECT MAX(sugar) AS sugar_max
FROM data2
-- Compute max by...
GROUP BY category) AS max_results
1 records
15.461 |
86.97 |
0.81 |
707.2168 |
26.59355 |