Introduction:
This R script below shows various approaches for interacting with a SQL database using R programming. It demonstrates how to connect to a database, conduct SQL queries, and obtain data using a variety of methods.
Objective:
The purpose of this script is to demonstrate three distinct approaches for querying data from a SQL database in R. These solutions include running SQL queries directly in RStudio, creating SQL statements within R functions, and using dplyr functions to query the dataset without using SQL scripts.
Benefits:
Comprehensive Coverage: This script covers a variety of methods for querying data from a SQL database, catering to users with varied degrees of SQL knowledge.
Flexibility: Users can select the technique that best meets their needs and competence with SQL and R programming.
Ease of Learning: By giving examples and explanations for each method, this script serves as a learning resource for people wishing to improve their data manipulation and database interface skills using R.
Practical Application: The methods demonstrated in this script can be used in real-world settings such as data analysis, reporting, and database management.
library(odbc) #Contains drivers to connect to a database
library(DBI) #Contains functions for interacting with the database
library(RSQL) #Generate and Process 'SQL' Queries in R
library(RSQLite) #Can create an in-memory SQL database
library(tidyverse) #Provides helpful functions and sample data to use in R
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data("population")
population
## # A tibble: 4,060 × 3
## country year population
## <chr> <dbl> <dbl>
## 1 Afghanistan 1995 17586073
## 2 Afghanistan 1996 18415307
## 3 Afghanistan 1997 19021226
## 4 Afghanistan 1998 19496836
## 5 Afghanistan 1999 19987071
## 6 Afghanistan 2000 20595360
## 7 Afghanistan 2001 21347782
## 8 Afghanistan 2002 22202806
## 9 Afghanistan 2003 23116142
## 10 Afghanistan 2004 24018682
## # ℹ 4,050 more rows
data("who")
who
## # A tibble: 7,240 × 60
## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghani… AF AFG 1980 NA NA NA NA
## 2 Afghani… AF AFG 1981 NA NA NA NA
## 3 Afghani… AF AFG 1982 NA NA NA NA
## 4 Afghani… AF AFG 1983 NA NA NA NA
## 5 Afghani… AF AFG 1984 NA NA NA NA
## 6 Afghani… AF AFG 1985 NA NA NA NA
## 7 Afghani… AF AFG 1986 NA NA NA NA
## 8 Afghani… AF AFG 1987 NA NA NA NA
## 9 Afghani… AF AFG 1988 NA NA NA NA
## 10 Afghani… AF AFG 1989 NA NA NA NA
## # ℹ 7,230 more rows
## # ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
## # new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
## # new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
## # new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
## # new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
## # new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>, …
#Build the placeholder (CREATE A DATABASE)
con <- dbConnect(drv = RSQLite::SQLite(),
dbname = ":memory:")
dbListTables(con)
## character(0)
#Load the population table
dbWriteTable(conn = con,
name = "population",
value = population)
#Load the who table
dbWriteTable(conn = con,
name = "who",
value = who)
dbListTables(con)
## [1] "population" "who"
#remove the local data from the environment
rm(who, population)
Viola now i am connected to a SQL database containing two tables
This tutorial provides three examples of executing a SQL query in R. The queries are identical so that you can see how the methods differ even when the output does not. Our database has two tables: ‘who’ and ‘population’. you can preview the data in the ‘who’ tables with this code:
tbl(src = con, #the source of the database connection profile
"who") #the name of the table to preview
## # Source: table<`who`> [?? x 60]
## # Database: sqlite 3.45.2 [:memory:]
## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghani… AF AFG 1980 NA NA NA NA
## 2 Afghani… AF AFG 1981 NA NA NA NA
## 3 Afghani… AF AFG 1982 NA NA NA NA
## 4 Afghani… AF AFG 1983 NA NA NA NA
## 5 Afghani… AF AFG 1984 NA NA NA NA
## 6 Afghani… AF AFG 1985 NA NA NA NA
## 7 Afghani… AF AFG 1986 NA NA NA NA
## 8 Afghani… AF AFG 1987 NA NA NA NA
## 9 Afghani… AF AFG 1988 NA NA NA NA
## 10 Afghani… AF AFG 1989 NA NA NA NA
## # ℹ more rows
## # ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
## # new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
## # new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
## # new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
## # new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
## # new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>, …
The two tables contain different information. ‘population’ contains the number of people in each country for the years 1995 - 2013, and ‘who’ contains the number of tuberculosis cases across each country, year, age group, and diagnosis method. You can read more about the column meanings by typing ‘?who’ into the console.
What i want to do is to merge and store data from both tables into a new R object. The new data should contain four columns: - Country - year - new cases in 35 - 44 year old males - the Country’s total pop for that year.
I also want to filter the data according to a few criteria: - only cases in Brazil and Germany - Only cases between 2000 and 2010
Lets get started.
This method works best if you already know how to write SQL queries ‘and’ are working in an RMarkdown file. if both those conditions aren’t true, feel free to skip this method.
Create a new RMarkdown chunk, but change its heading to {sql, connection = con, output.var = “M1_results”} instead of the default {r}
Ran a SQL Query to ask for specific data in our database - in this case ‘con’ object and store in R
SELECT who.country, who.year, who.new_sp_m2534, population.population
FROM who
LEFT JOIN population ON population.country = who.country AND population.year = who.year
WHERE who.country IN ('Brazil', 'Germany') AND who.year >= 2000 AND who.year <= 2010
Rstudio can natively read SQL script when it’s in a markdown chunk
set to {sql}. output.var sets the name of the data
frame to store the result in, which we have called M1_results.
The text is the SQL script required to retrieve and filter the data from
both tables in our database.
You can now work with this data like you would with any other data frame. Let’s take a look at the first five rows:
head(M1_results)
## country year new_sp_m2534 population
## 1 Brazil 2000 11568 174504898
## 2 Brazil 2001 5536 176968205
## 3 Brazil 2002 5890 179393768
## 4 Brazil 2003 5709 181752951
## 5 Brazil 2004 6321 184010283
## 6 Brazil 2005 6119 186142403
You’re not always able to write a SQL statement in a dedicated chunk. For instance, you might be looping through some data and need to execute a SQL query based on a changing value. This means you will need to programmatically change the SQL query, which is best done with the R script.
Though not required for this example, I’ll demonstrate how to build the SQL query in multiple pieces and then join them together as a single piece of text. You can modify this method to work with loops or other R objects. We’ll use the same query shown in the first method.
The dbGetQuery function will process and send the SQL statement to the database, and then return data which we can store in a data frame called M1_results
select <- "SELECT who.country, who.year, who.new_sp_m2534, population.population"
from <- "FROM who"
ljoin <- "LEFT JOIN population ON population.country = who.country AND population.year = who.year"
where <- "WHERE who.country IN ('Brazil', 'Germany') AND who.year >= 2000 AND who.year <= 2010"
query <- paste(select, from, ljoin, where)
M2_results <- DBI::dbGetQuery(conn = con,
statement = query)
head(M2_results)
## country year new_sp_m2534 population
## 1 Brazil 2000 11568 174504898
## 2 Brazil 2001 5536 176968205
## 3 Brazil 2002 5890 179393768
## 4 Brazil 2003 5709 181752951
## 5 Brazil 2004 6321 184010283
## 6 Brazil 2005 6119 186142403
select <- "SELECT who.country, who.year, who.new_sp_m2534, population.population"
from <- "FROM who"
ljoin <- "LEFT JOIN population ON population.country = who.country AND population.year = who.year"
where <- "WHERE who.country IN ('Brazil', 'Germany') AND who.year >= 2000 AND who.year <= 2010"
query <- paste(select, from, ljoin, where)
M2_results <- DBI::dbGetQuery(conn = con,
statement = query)
head(M2_results)
## country year new_sp_m2534 population
## 1 Brazil 2000 11568 174504898
## 2 Brazil 2001 5536 176968205
## 3 Brazil 2002 5890 179393768
## 4 Brazil 2003 5709 181752951
## 5 Brazil 2004 6321 184010283
## 6 Brazil 2005 6119 186142403
This method is the easiest for those who are already comfortable working with dplyr functions but not yet familier with SQL scripting. The dplyr package was actually designed to replicate the most common SQL commands. so you may find it easy to learn the SQL-equivalent of your favorite dplyr function!
M3_results <-
tbl(src = con, "who") %>%
filter(country %in% c("Brazil", "Germany"),
year >= 2000,
year <= 2010) %>%
dplyr::select(country, year, new_sp_m2534) %>%
left_join(y = tbl(src = con, "population"),
by = c("country", "year")) %>%
collect() #this tells dplyr to execute and store the query