Benefits to a database relative to traditional method like importing excel or csv files:

These two last things become much more an issue regarding GDPR regulations. A database is designed to allow some information to be stored separately from other pieces of information and then be joined together only by personnel that are able to access them, so the database has better user permission than Excel, or csv file in which once somebody has the file, they have all the data.

The downside of database is that it is more complex to set up and administer, to manage and maintain. It usually requires technical training to be able to do that.

There are lots of different SQL databases out there. You may have heard of databases like MySQL, PostgreSQL, MS Access, SQLite, etc

SQLite is simple, each database is just a single file that you can move and store in all of the usual ways that you can be will any other files, and as a result, it works basically anywhere with almost no setup. It’s great for relatively small projects up to several million rows that don’t require simultaneous data entry by multiple people

But if you already have SQL database set up, learning SQL to access these data is greatly useful. SQL is a language used to query or retrieve data from a relational database

SQL statements fall into 2 main categories: Data Definition Language (DDL) statements and Data Manipulation Language (DML) statements. DML statements like INSERT, SELECT, UPDATE, and DELETE are used to read and modify data. DDL statements (e.g., CREATE, ALTER, DROP) are used to define, change or delete database objects, such as tables, not the data within tables

1. Using DBI and odbc packages

1.1. Prepare database

#install.packages("odbc") # contain drivers to connect to a database
#install.packages("DBI") # contain functions to interact with database
library(odbc)
## Warning: package 'odbc' was built under R version 4.4.3
library(DBI)
## Warning: package 'DBI' was built under R version 4.4.3

As you don’t have already have database, now you create it to practice with SQL commands to query

  • Load the build-in data frames (sample data) that come with R packages
data("population")
data("who")

After running this, you have 2 data frames in your R environment

1.2. Establish connection profile to a database

The first step with using any database from R is then to connect to that database

con = dbConnect(drv = RSQLite::SQLite(), # specify driver to translate between R and the database
                 dbname = ":memory:")

dbConnect() function is used to create a connection to a database. RSQLite::SQLite() tells R to use SQLite, a lightweight database engine. dbname = ":memory:"means you are creating a temporary in-memory database, which means it only exists while R session is running; nothing is written to disk. con is database connection object.

  • Populate data to the created database
dbWriteTable(conn = con, # tell R which database connection to use
             name = "population", # give the table a name inside the database
             value = population # the data frame you want to save
             )
dbWriteTable(conn = con, 
             name = "who",
             value = who)

After this step, your database contains 2 tables: population and who

  • Remove the local data from the environment
rm(who, population) # remove objects from R environment

Now, the only copy of who and population data frames is in SQLite database, not in your R environment

# list tables in con database
dbListTables(con)
## [1] "population" "who"

1.3. Query

Preview a table in a database

# extract a table from a source
tbl(src = con,
    "who")

You can execute SQL query using 3 below methods, which will produce identical results. You can choose either based on your preference.

  • Method 1: Run a query with SQL chunk of code {sql connection= con, output.var = "Result_1"} instead of {r}. This method is for people who already know SQL. You have to use RMarkdown. You have to know this code of chunk is for SQL and other codes of chunk are for R, and they are not interchangeable
SELECT 
who.country, who.year, who.new_sp_m3544, population.population
FROM 
who
LEFT JOIN 
population 
  ON population.country = who.country 
    AND population.year = who.year
WHERE
who.country IN ('Brazil', 'Germany') AND
who.year >= 2000 AND
who.year <= 2010

This query select 3 columns from who table and 1 column from population table, join the 2 tables via {LEFT JOIN} if rows matched on 2 variables countryand year, and filter rows with {WHERE}statement

After this SQL query, we create a new object in the R environment called Result_1, which is a regular data frame

class(Result_1)
## [1] "data.frame"
Result_1
summary(Result_1)
##    country               year       new_sp_m3544     population       
##  Length:22          Min.   :2000   Min.   :  102   Min.   : 83017404  
##  Class :character   1st Qu.:2002   1st Qu.:  169   1st Qu.: 83608886  
##  Mode  :character   Median :2005   Median : 5184   Median :129176871  
##                     Mean   :2005   Mean   : 3383   Mean   :134569256  
##                     3rd Qu.:2008   3rd Qu.: 6034   3rd Qu.:185609373  
##                     Max.   :2010   Max.   :11906   Max.   :195210154  
##                                    NA's   :1
str(Result_1)
## 'data.frame':    22 obs. of  4 variables:
##  $ country     : chr  "Brazil" "Brazil" "Brazil" "Brazil" ...
##  $ year        : num  2000 2001 2002 2003 2004 ...
##  $ new_sp_m3544: num  11906 5184 6325 6034 6481 ...
##  $ population  : num  1.75e+08 1.77e+08 1.79e+08 1.82e+08 1.84e+08 ...
  • Method 2: Run using R code chunk. You need to know SQL but do not need to use RMarkdown because you use SQL code as text within R function
# create string for the query 
select <- "SELECT who.country, who.year, who.new_sp_m3544, population.population"
from <- "FROM who"
join <- "LEFT JOIN 
population 
  ON population.country = who.country 
    AND population.year = who.year"

where <- "WHERE
who.country IN ('Brazil', 'Germany') AND
who.year >= 2000 AND
who.year <= 2010"

query <- paste(select, from, join, where)
Result_2 <- dbGetQuery(conn = con, # specified database connection
               statement = query)
head(Result_2)
  • Method 3: is for those who don’t know about SQL but very familiar with {tidyverse}
library(dplyr)
Result_3 <- 
  tbl(src = con, "who") %>%  # extract table who from con database  
    filter(country %in% c("Brazil", "Germany"),
         year >= 2000,
         year <= 2010) %>%  
  dplyr::select(country, year, new_sp_m014) %>%
  left_join(y = tbl(src = con, "population"),
            by = c("country", "year")) %>%  
  collect() #this tells dplyr to execute and store the query, to get the result in R environment

Now, you have Result_2, Result_3 the same as Result_1

# combine method 2 and method 3
tbl(con,
    sql(query)) %>%  collect()

1.4. Put the data back into the database from R

For example, we want to bring the Result_1 to database con as a permanent table

#First, check the available tables in `con` 
dbListTables(conn = con)
## [1] "population" "who"
copy_to(con, # the database
        Result_1, # the table  we want to copy to the database
        temporary=F # to have Result_1 stored permanently in the con database
        )
dbListTables(con)
## [1] "Result_1"     "population"   "sqlite_stat1" "sqlite_stat4" "who"

Now, we have Result_1 table now, and 2 additional helper tables that you do not need to worry about

1.5. Other Data Manipulation language (DML)

  • Update all rows in who where year = 1980, setting new_sp_m014 = 1
dbExecute(conn = con,
  statement =  "UPDATE who
  SET new_sp_m014 = 1
  WHERE year = 1980"
)
## [1] 212
# check results
dbGetQuery(conn = con,
           "SELECT country, year, new_sp_m014
           FROM who
           WHERE year =1980
           LIMIT 10")
  • Insert new columns into existing tables (populate tables)
INSERT INTO who
(country, year)
VALUES("USA", 2025)

Check

tbl(src = con, "who")  %>% 
  collect() %>% 
  filter(country=="USA")

2. Using package sqldf

Package sqldf is another option to use SQL right inside R

#install.packages("sqldf")
library(sqldf)
## Warning: package 'sqldf' was built under R version 4.4.3
## Loading required package: gsubfn
## Warning: package 'gsubfn' was built under R version 4.4.3
## Loading required package: proto
## Warning: package 'proto' was built under R version 4.4.3
## Loading required package: RSQLite
## Warning: package 'RSQLite' was built under R version 4.4.3
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load data frame 
data("birthwt")

# description of the data frame
#?birthwt

Actually, the data frame is a table. In the concept of SQL, table is an entity within a relational database. Attributes are each tables’ columns which characterize tables

# explore data frame
head(birthwt)
str(birthwt)
## 'data.frame':    189 obs. of  10 variables:
##  $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : int  2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

Now, I will use SQL to query distinct combination values of age and smoke from birthwt and limit the output to just 5 rows

sqldf("SELECT DISTINCT age, smoke 
      FROM birthwt
      LIMIT 5");

Now let ’s practice some queries

sqldf("SELECT race,
      COUNT(*) as frequency 
      FROM birthwt
      GROUP BY race
      ORDER BY frequency DESC");
sqldf("SELECT race, ftv,
              COUNT(*) as frequency 
      FROM birthwt
      GROUP BY race, ftv
      ORDER BY race");

These commands can help uncover some patterns of the data, e.g., disparities in terms of race, etc