In this tutorial i will show you it’s easy to learn R’s dplyr together with SQL. I show some of the most similar ways of handling data in R and SQL

Introduction

learning to code in more than one software at once is no easy pizzy but it’s much easy if the two softwares have some form of resemblence in them . In this tutorial i will show you how to connect to a database using Rmarkdown and Rstudio as well as query this database using SQL (Within an rmarkdown document)

setup

  • load packages required (assuming you already installed them)
library(tidyverse)
library(odbc)
library(DBI)
library(RSQLite)

data for the tutorial

Next up we load our data for the tutorial , I have used a dataset from a project i did from datacamp to be certified as data scientist associate . I then sample only a few rows and create two dataframes which i will turn into tables later on.

## read in the dataset
df <- readr::read_csv("recipe_site_traffic_2212.csv")

## sample 100 observations and select first 3 variables
set.seed(1123)
data1<- df |>
  sample_n(size=25) |> 
  select(1,2,3,6)

## sample 100 observations and select subsequent 3 variables(including the first(ID))
## 
set.seed(1123)
data2<- df |>
  sample_n(size=25) |> 
  select(1,2,4,5,6)

Create a database

con<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con,data1)
copy_to(con,data2)

Sql queries in R and dplyr code

  • A query is a request for data from a database table (or combination of tables).
  • Querying is an essential skill for a data scientist, since the data you need for your analyses will often live in databases.

Selecting all data in SQL

  • In SQL, you can select data from a table using a SELECT statement. For example, the following query selects the name column from the people table: SELECT name FROM people;
  • you may want to select all columns from a table use * rather than typing all the columns
dbGetQuery(con,'SELECT * FROM data2')

Selecting a subset of variables

dbGetQuery(con,'SELECT recipe,sugar FROM data2')

Selecting all data in R

data2 |> select(everything())

selecting a subset of variables in R

data2 |> 
  select(recipe,sugar)

filtering using WHERE in SQL

  • WHERE is a filtering clause

  • In SQL, the WHERE keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use:

  • = equal

  • <> not equal

  • < less than

  • > greater than

  • <= less than or equal to

  • >= greater than or equal to

  • You can build up your WHERE queries by combining multiple conditions with the AND keyword.

dbGetQuery(con,"SELECT * 
                FROM data2 
                WHERE category='Potato'")

filtering in R

data2 |> 
  filter(category=="Potato")

filtering using IN and WHERE in SQL

dbGetQuery(con,"SELECT * 
                FROM data2 
                WHERE category IN ('Chicken','Chicken Breast')")

filtering using %in%

data2 |> 
  filter(category %in% c("Chicken","Chicken Breast"))

COUNT and AS alias in SQL

  • The COUNT statement lets you count then returning the number of rows in one or more columns.
  • we use AS to rename default name
  • COUNT(*) tells you how many rows are in a table
dbGetQuery(con,"SELECT COUNT(*) AS n FROM data2")

Counting in dplyr

data2 |> count()

count filters in SQL

dbGetQuery(con,"SELECT COUNT(category) AS n_potato
                FROM data2 
                WHERE category IN ('Chicken Breast','Chicken')")

counting filters in dplyr

data2 |> 
  filter(category %in% c("Chicken","Chicken Breast")) |> 
  summarise(n_potato=n())

Updating a database using dbExecute

  • categories before
data2 |> 
  janitor::tabyl(category)

Updating a database

  • we need to change a category called Breakfast to Breakfast meal
  • we shall use REPLACE
dbExecute(con,"UPDATE data2 
               SET category= REPLACE(category,'Breakfast','Breakfast meal')
               WHERE category='Breakfast'")
#> [1] 4

Querying to check results

dbGetQuery(con,"SELECT category, COUNT(category) as n_per_category
                FROM data2
                GROUP BY category")
  • we see that Breakfast has updated to Breakfast meal

Updating and grouping in R

  • We use stringr::str_replace to replace elements in a string variable
data2 |> 
  mutate(category=stringr::str_replace(category,"Breakfast","Breakfast meal")) |> 
  group_by(category) |> 
  summarise(n_per_category=n())

Missing data in SQL

+n SQL, NULL represents a missing or unknown value. You can check for NULL values using the expression IS NULL + Use IS NULL AND IS NOT NULL

dbGetQuery(con,"SELECT * 
                FROM data2 
                WHERE Protein IS NULL")
dbGetQuery(con,"SELECT * 
                FROM data2 
                WHERE Protein IS NOT NULL")

missing values in R

  • use is.na() and !is.na
data2 |> 
  filter(is.na(protein))
data2 |> 
  filter(!is.na(protein))

How many categories do we have

select distinct items in SQL

  • Often your results will include many duplicate values. If you want to select all the unique values from a column, you can use the DISTINCT keyword.
dbGetQuery(con,"SELECT COUNT(DISTINCT category) AS unique_categories 
                FROM data2")

select distinct in R

  • use distinct in R
data2 |> 
  distinct(category) |> 
  summarise(unique_categories=n())

Advanced filters

filtering in SQL

dbGetQuery(con,"SELECT * 
                FROM data2 WHERE sugar > 1 AND sugar < 5 
                AND category='Breakfast meal'")
  • also achieved by using BETWEEN and AND
dbGetQuery(con,"SELECT * 
                FROM data2 
                WHERE sugar BETWEEN 1 AND 5 
                AND category='Breakfast meal'")

In R

data2 |> 
  filter(between(sugar,1,5) & category=="Breakfast")

filtering text data in SQL

  • use LIKE
  • the LIKE operator can be used in a WHERE clause to search for a pattern in a column.
  • To accomplish this, you use something called a wildcard as a placeholder for some other values. There are two wildcards you can use with LIKE:

The % wildcard will match zero, one, or many characters in text

The _ wildcard will match a single character

dbGetQuery(con,"SELECT * 
                FROM data2 
                WHERE category LIKE 'Chicken%'")

Filter text in R

  • use str_like
data2 |> 
  filter(str_like(category,"Chicken%"))

Aggregate functions in SQL

sqldf::sqldf("SELECT AVG(sugar) AS avg_sugar,
                       MAX(sugar) AS max_sugar,
                       MIN(sugar) AS min_sugar
                FROM data2;") 

Aggregate functions in R

data2 |> 
  summarise(avg_sugar=mean(sugar,na.rm=T),
            max_sugar= max(sugar,na.rm=T),
            min_sugar= min(sugar,na.rm=T))

Grouping and aggregating in SQL

sqldf::sqldf("SELECT category,AVG(sugar) AS avg_sugar,
                       MAX(sugar) AS max_sugar,
                       MIN(sugar) AS min_sugar
                FROM data2
                GROUP BY category;") 

group and Aggregate functions in R

data2 |>
  group_by(category) |> 
  summarise(avg_sugar=mean(sugar,na.rm=T),
            max_sugar= max(sugar,na.rm=T),
            min_sugar= min(sugar,na.rm=T))

Basic arithmetic

  • In addition to using aggregate functions, you can perform basic arithmetic with symbols like _+, -, *, and /_.
  • lets create weird variables here
sqldf::sqldf("SELECT category,(sugar-protein) AS diff_sugar_protein 
                FROM data2")

basic arithmetic in R using mutate()

data2 |> 
  mutate(diff_sugar_protein=sugar-protein) |> 
  select(category,diff_sugar_protein)

SQL subqueries

  • use data1 and data2 databases for this task
  • the following is just an example and calories in data2 and data1 are the same so the result is the same
sqldf::sqldf("SELECT recipe,calories
                FROM data2 
                WHERE calories > 
                          (SELECT AVG(calories) FROM data1)")
  • the above is the same as follows in R
data2 |> 
  select(recipe,sugar,calories) |> 
  filter(calories>(data1 |> 
           summarise(mean_calories=mean(calories,na.rm=TRUE)) |> 
             pull(mean_calories)))

Joining tables in SQL and R

  • Inner Join - only keep observations found in both x and y
  • Left Join - keep all observations in x
  • Right Join - keep all observations in y
  • Full Join - keep any observations in x or y

for this dataset we shall use INNER JOIN Alone

Inner Join

When talking about inner joins, we are only going to keep an observation if it is found in all of the tables we’re combining.

sqldf::sqldf("SELECT data2.category,data2.recipe,data2.calories,
                       carbohydrate,protein,sugar 
                FROM data2
                INNER JOIN data1 ON data2.recipe=data1.recipe")

Inner join in R

dplyr::inner_join(data1,data2,by=join_by(recipe)) |> 
  select(recipe,carbohydrate,sugar,protein,ends_with(".x"))

For full,right and left joins i am using some dummy datasets that i created

setting up the data

# read in example csv files (join_df1 and join_df2)
join_df1 <- read.csv("join_df1.csv") |> 
  rename(A=`ï..A`)
join_df2 <- read.csv("join_df2.csv")|> 
  rename(A=`ï..A`)

join<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(join,join_df1)
copy_to(join,join_df2)

look at the databases now

dbGetQuery(join,"SELECT * FROM join_df1")
dbGetQuery(join,"SELECT * FROM join_df2")

left join in SQL

Left Join

For a left join, all rows in the first table specified will be included in the output. Any row in the second table that is not in the first table will not be included.

sqldf::sqldf("SELECT * 
                 FROM join_df1
                 LEFT JOIN join_df2
                 USING(A)")

RIGHT JOIN IN SQL

Right Join is similar to what we just discussed; however, in the output from a right join, all rows in the final table specified are included in the output. NAs will be included for any observations found in the last specified table but not in the other tables.

sqldf::sqldf("SELECT * 
                 FROM join_df1
                 RIGHT JOIN join_df2
                 USING(A)")

left join df1 and df2

dplyr::left_join(join_df1, join_df2,by = join_by(A))

right join df1 and df2

dplyr::right_join(join_df1, join_df2,by = join_by(A))

full join in SQL

Finally, a full join will take every observation from every table and include it in the output.

sqldf::sqldf("SELECT * FROM join_df1 
                 FULL JOIN join_df2
                 USING(A)")

full join df1 and df2 specifying which column is the same

dplyr::full_join(join_df1, join_df2,by = join_by(A))

References

  • Tips on Getting Started with databases using R (from Rstudio)
  • Using SQL in Rstudio from Irene Steves - a discussion of the Rubymine IDE and working with SQL in RSTUDIO.
  • CRAN