Set working directory

Introduction:

This R script below shows various approaches for interacting with a SQL database using R programming. It demonstrates how to connect to a database, conduct SQL queries, and obtain data using a variety of methods.

Objective:

The purpose of this script is to demonstrate three distinct approaches for querying data from a SQL database in R. These solutions include running SQL queries directly in RStudio, creating SQL statements within R functions, and using dplyr functions to query the dataset without using SQL scripts.

Benefits:

Load Libraries neccessary

library(odbc) #Contains drivers to connect to a database
library(DBI) #Contains functions for interacting with the database
library(RSQL) #Generate and Process 'SQL' Queries in R
library(RSQLite) #Can create an in-memory SQL database
library(tidyverse) #Provides helpful functions and sample data to use in R
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

PART 1: Connect & Create A Database (from tidyverse local data/‘memory’)

data("population")
population
## # A tibble: 4,060 × 3
##    country      year population
##    <chr>       <dbl>      <dbl>
##  1 Afghanistan  1995   17586073
##  2 Afghanistan  1996   18415307
##  3 Afghanistan  1997   19021226
##  4 Afghanistan  1998   19496836
##  5 Afghanistan  1999   19987071
##  6 Afghanistan  2000   20595360
##  7 Afghanistan  2001   21347782
##  8 Afghanistan  2002   22202806
##  9 Afghanistan  2003   23116142
## 10 Afghanistan  2004   24018682
## # ℹ 4,050 more rows
data("who")
who
## # A tibble: 7,240 × 60
##    country  iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
##    <chr>    <chr> <chr> <dbl>       <dbl>        <dbl>        <dbl>        <dbl>
##  1 Afghani… AF    AFG    1980          NA           NA           NA           NA
##  2 Afghani… AF    AFG    1981          NA           NA           NA           NA
##  3 Afghani… AF    AFG    1982          NA           NA           NA           NA
##  4 Afghani… AF    AFG    1983          NA           NA           NA           NA
##  5 Afghani… AF    AFG    1984          NA           NA           NA           NA
##  6 Afghani… AF    AFG    1985          NA           NA           NA           NA
##  7 Afghani… AF    AFG    1986          NA           NA           NA           NA
##  8 Afghani… AF    AFG    1987          NA           NA           NA           NA
##  9 Afghani… AF    AFG    1988          NA           NA           NA           NA
## 10 Afghani… AF    AFG    1989          NA           NA           NA           NA
## # ℹ 7,230 more rows
## # ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
## #   new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
## #   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
## #   new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
## #   new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
## #   new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>, …
#Build the placeholder (CREATE A DATABASE)
con <- dbConnect(drv = RSQLite::SQLite(),
                 dbname = ":memory:")

dbListTables(con)
## character(0)
#Load the population table
dbWriteTable(conn = con, 
             name = "population",
             value = population)

#Load the who table
dbWriteTable(conn = con, 
             name = "who",
             value = who)

dbListTables(con)
## [1] "population" "who"
#remove the local data from the environment
rm(who, population)

Viola now i am connected to a SQL database containing two tables

PART 2: Databasing in R

This tutorial provides three examples of executing a SQL query in R. The queries are identical so that you can see how the methods differ even when the output does not. Our database has two tables: ‘who’ and ‘population’. you can preview the data in the ‘who’ tables with this code:

tbl(src = con, #the source of the database connection profile
    "who") #the name of the table to preview
## # Source:   table<`who`> [?? x 60]
## # Database: sqlite 3.45.2 [:memory:]
##    country  iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
##    <chr>    <chr> <chr> <dbl>       <dbl>        <dbl>        <dbl>        <dbl>
##  1 Afghani… AF    AFG    1980          NA           NA           NA           NA
##  2 Afghani… AF    AFG    1981          NA           NA           NA           NA
##  3 Afghani… AF    AFG    1982          NA           NA           NA           NA
##  4 Afghani… AF    AFG    1983          NA           NA           NA           NA
##  5 Afghani… AF    AFG    1984          NA           NA           NA           NA
##  6 Afghani… AF    AFG    1985          NA           NA           NA           NA
##  7 Afghani… AF    AFG    1986          NA           NA           NA           NA
##  8 Afghani… AF    AFG    1987          NA           NA           NA           NA
##  9 Afghani… AF    AFG    1988          NA           NA           NA           NA
## 10 Afghani… AF    AFG    1989          NA           NA           NA           NA
## # ℹ more rows
## # ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
## #   new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
## #   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
## #   new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
## #   new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
## #   new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>, …

The two tables contain different information. ‘population’ contains the number of people in each country for the years 1995 - 2013, and ‘who’ contains the number of tuberculosis cases across each country, year, age group, and diagnosis method. You can read more about the column meanings by typing ‘?who’ into the console.

What i want to do is to merge and store data from both tables into a new R object. The new data should contain four columns: - Country - year - new cases in 35 - 44 year old males - the Country’s total pop for that year.

I also want to filter the data according to a few criteria: - only cases in Brazil and Germany - Only cases between 2000 and 2010

Lets get started.

METHOD 1: Write a SQL Statement and execute it in Rstudio

SELECT who.country, who.year, who.new_sp_m2534, population.population
FROM who
LEFT JOIN population ON population.country = who.country AND population.year = who.year
WHERE who.country IN ('Brazil', 'Germany') AND who.year >= 2000 AND who.year <= 2010

Rstudio can natively read SQL script when it’s in a markdown chunk set to {sql}. output.var sets the name of the data frame to store the result in, which we have called M1_results. The text is the SQL script required to retrieve and filter the data from both tables in our database.

You can now work with this data like you would with any other data frame. Let’s take a look at the first five rows:

head(M1_results)
##   country year new_sp_m2534 population
## 1  Brazil 2000        11568  174504898
## 2  Brazil 2001         5536  176968205
## 3  Brazil 2002         5890  179393768
## 4  Brazil 2003         5709  181752951
## 5  Brazil 2004         6321  184010283
## 6  Brazil 2005         6119  186142403

Method 2: Write a SQL statement in an R function

You’re not always able to write a SQL statement in a dedicated chunk. For instance, you might be looping through some data and need to execute a SQL query based on a changing value. This means you will need to programmatically change the SQL query, which is best done with the R script.

Though not required for this example, I’ll demonstrate how to build the SQL query in multiple pieces and then join them together as a single piece of text. You can modify this method to work with loops or other R objects. We’ll use the same query shown in the first method.

The dbGetQuery function will process and send the SQL statement to the database, and then return data which we can store in a data frame called M1_results

select <- "SELECT who.country, who.year, who.new_sp_m2534, population.population"
from <- "FROM who"
ljoin <- "LEFT JOIN population ON population.country = who.country AND population.year = who.year"
where <- "WHERE who.country IN ('Brazil', 'Germany') AND who.year >= 2000 AND who.year <= 2010"

query <- paste(select, from, ljoin, where)

M2_results <- DBI::dbGetQuery(conn = con,
                              statement = query)

head(M2_results)
##   country year new_sp_m2534 population
## 1  Brazil 2000        11568  174504898
## 2  Brazil 2001         5536  176968205
## 3  Brazil 2002         5890  179393768
## 4  Brazil 2003         5709  181752951
## 5  Brazil 2004         6321  184010283
## 6  Brazil 2005         6119  186142403
select <- "SELECT who.country, who.year, who.new_sp_m2534, population.population"
from <- "FROM who"
ljoin <- "LEFT JOIN population ON population.country = who.country AND population.year = who.year"
where <- "WHERE who.country IN ('Brazil', 'Germany') AND who.year >= 2000 AND who.year <= 2010"

query <- paste(select, from, ljoin, where)

M2_results <- DBI::dbGetQuery(conn = con,
                              statement = query)

head(M2_results)
##   country year new_sp_m2534 population
## 1  Brazil 2000        11568  174504898
## 2  Brazil 2001         5536  176968205
## 3  Brazil 2002         5890  179393768
## 4  Brazil 2003         5709  181752951
## 5  Brazil 2004         6321  184010283
## 6  Brazil 2005         6119  186142403

Method 3: Use dplyr to query the dataset without any SQL

This method is the easiest for those who are already comfortable working with dplyr functions but not yet familier with SQL scripting. The dplyr package was actually designed to replicate the most common SQL commands. so you may find it easy to learn the SQL-equivalent of your favorite dplyr function!

M3_results <- 
  tbl(src = con, "who") %>% 
  filter(country %in% c("Brazil", "Germany"),
         year >= 2000,
         year <= 2010) %>% 
  dplyr::select(country, year, new_sp_m2534) %>% 
  left_join(y = tbl(src = con, "population"),
            by = c("country", "year")) %>% 
  collect() #this tells dplyr to execute and store the query