Benefits to a database relative to traditional method like importing excel or csv files:
Efficient storage and retrieval: relational database is designed to reduce the redundancy of data, which means you can store large amount of information with much smaller sizes compared to excel, csv file
Unlimited rows/ records:it is only limited by the hard drive capacity of the devices that the database is stored on, compared to excel which only allows for a million rows
Always up-to-date: if you connect directly to the data source as data gets edited, added, or deleted, you don’t need to go and ask somebody for the updated file. You simply redo your query and you immediately access the most relevant information
Allows real-time collaboration: Multiple users can work with a database simultaneously from different locations
Enhanced data privacy
Authentication protocols
These two last things become much more an issue regarding GDPR regulations. A database is designed to allow some information to be stored separately from other pieces of information and then be joined together only by personnel that are able to access them, so the database has better user permission than Excel, or csv file in which once somebody has the file, they have all the data.
The downside of database is that it is more complex to set up and administer, to manage and maintain. It usually requires technical training to be able to do that.
There are lots of different SQL databases out there. You may have heard of databases like MySQL, PostgreSQL, MS Access, SQLite, etc
SQLite is simple, each database is just a single file that you can move and store in all of the usual ways that you can be will any other files, and as a result, it works basically anywhere with almost no setup. It’s great for relatively small projects up to several million rows that don’t require simultaneous data entry by multiple people
But if you already have SQL database set up, learning SQL to access these data is greatly useful. SQL is a language used to query or retrieve data from a relational database
SQL statements fall into 2 main categories: Data Definition Language (DDL) statements and Data Manipulation Language (DML) statements. DML statements like INSERT, SELECT, UPDATE, and DELETE are used to read and modify data. DDL statements (e.g., CREATE, ALTER, DROP) are used to define, change or delete database objects, such as tables, not the data within tables
DBI and odbc packages#install.packages("odbc") # contain drivers to connect to a database
#install.packages("DBI") # contain functions to interact with database
library(odbc)
## Warning: package 'odbc' was built under R version 4.4.3
library(DBI)
## Warning: package 'DBI' was built under R version 4.4.3
As you don’t have already have database, now you create it to practice with SQL commands to query
data("population")
data("who")
After running this, you have 2 data frames in your R environment
The first step with using any database from R is then to connect to that database
con = dbConnect(drv = RSQLite::SQLite(), # specify driver to translate between R and the database
dbname = ":memory:")
dbConnect() function is used to create a connection to a
database. RSQLite::SQLite() tells R to use SQLite, a
lightweight database engine. dbname = ":memory:"means you
are creating a temporary in-memory database, which means it only exists
while R session is running; nothing is written to disk. con
is database connection object.
dbWriteTable(conn = con, # tell R which database connection to use
name = "population", # give the table a name inside the database
value = population # the data frame you want to save
)
dbWriteTable(conn = con,
name = "who",
value = who)
After this step, your database contains 2 tables:
population and who
rm(who, population) # remove objects from R environment
Now, the only copy of who and population
data frames is in SQLite database, not in your R environment
# list tables in con database
dbListTables(con)
## [1] "population" "who"
Preview a table in a database
# extract a table from a source
tbl(src = con,
"who")
You can execute SQL query using 3 below methods, which will produce identical results. You can choose either based on your preference.
{sql connection= con, output.var = "Result_1"} instead of
{r}. This method is for people who already know SQL. You
have to use RMarkdown. You have to know this code of
chunk is for SQL and other codes of chunk are for R, and they are not
interchangeableSELECT
who.country, who.year, who.new_sp_m3544, population.population
FROM
who
LEFT JOIN
population
ON population.country = who.country
AND population.year = who.year
WHERE
who.country IN ('Brazil', 'Germany') AND
who.year >= 2000 AND
who.year <= 2010
This query select 3 columns from who table and 1 column
from population table, join the 2 tables via
{LEFT JOIN} if rows matched on 2 variables
countryand year, and filter rows with
{WHERE}statement
After this SQL query, we create a new object in the R environment called Result_1, which is a regular data frame
class(Result_1)
## [1] "data.frame"
Result_1
summary(Result_1)
## country year new_sp_m3544 population
## Length:22 Min. :2000 Min. : 102 Min. : 83017404
## Class :character 1st Qu.:2002 1st Qu.: 169 1st Qu.: 83608886
## Mode :character Median :2005 Median : 5184 Median :129176871
## Mean :2005 Mean : 3383 Mean :134569256
## 3rd Qu.:2008 3rd Qu.: 6034 3rd Qu.:185609373
## Max. :2010 Max. :11906 Max. :195210154
## NA's :1
str(Result_1)
## 'data.frame': 22 obs. of 4 variables:
## $ country : chr "Brazil" "Brazil" "Brazil" "Brazil" ...
## $ year : num 2000 2001 2002 2003 2004 ...
## $ new_sp_m3544: num 11906 5184 6325 6034 6481 ...
## $ population : num 1.75e+08 1.77e+08 1.79e+08 1.82e+08 1.84e+08 ...
# create string for the query
select <- "SELECT who.country, who.year, who.new_sp_m3544, population.population"
from <- "FROM who"
join <- "LEFT JOIN
population
ON population.country = who.country
AND population.year = who.year"
where <- "WHERE
who.country IN ('Brazil', 'Germany') AND
who.year >= 2000 AND
who.year <= 2010"
query <- paste(select, from, join, where)
Result_2 <- dbGetQuery(conn = con, # specified database connection
statement = query)
head(Result_2)
{tidyverse}library(dplyr)
Result_3 <-
tbl(src = con, "who") %>% # extract table who from con database
filter(country %in% c("Brazil", "Germany"),
year >= 2000,
year <= 2010) %>%
dplyr::select(country, year, new_sp_m014) %>%
left_join(y = tbl(src = con, "population"),
by = c("country", "year")) %>%
collect() #this tells dplyr to execute and store the query, to get the result in R environment
Now, you have Result_2, Result_3 the same as Result_1
# combine method 2 and method 3
tbl(con,
sql(query)) %>% collect()
For example, we want to bring the Result_1 to database
con as a permanent table
#First, check the available tables in `con`
dbListTables(conn = con)
## [1] "population" "who"
copy_to(con, # the database
Result_1, # the table we want to copy to the database
temporary=F # to have Result_1 stored permanently in the con database
)
dbListTables(con)
## [1] "Result_1" "population" "sqlite_stat1" "sqlite_stat4" "who"
Now, we have Result_1 table now, and 2 additional helper
tables that you do not need to worry about
who where year = 1980,
setting new_sp_m014 = 1dbExecute(conn = con,
statement = "UPDATE who
SET new_sp_m014 = 1
WHERE year = 1980"
)
## [1] 212
# check results
dbGetQuery(conn = con,
"SELECT country, year, new_sp_m014
FROM who
WHERE year =1980
LIMIT 10")
INSERT INTO who
(country, year)
VALUES("USA", 2025)
Check
tbl(src = con, "who") %>%
collect() %>%
filter(country=="USA")
sqldfPackage sqldf is another option to use SQL right inside
R
#install.packages("sqldf")
library(sqldf)
## Warning: package 'sqldf' was built under R version 4.4.3
## Loading required package: gsubfn
## Warning: package 'gsubfn' was built under R version 4.4.3
## Loading required package: proto
## Warning: package 'proto' was built under R version 4.4.3
## Loading required package: RSQLite
## Warning: package 'RSQLite' was built under R version 4.4.3
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load data frame
data("birthwt")
# description of the data frame
#?birthwt
Actually, the data frame is a table. In the concept of SQL, table is an entity within a relational database. Attributes are each tables’ columns which characterize tables
# explore data frame
head(birthwt)
str(birthwt)
## 'data.frame': 189 obs. of 10 variables:
## $ low : int 0 0 0 0 0 0 0 0 0 0 ...
## $ age : int 19 33 20 21 18 21 22 17 29 26 ...
## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
## $ race : int 2 3 1 1 1 3 1 3 1 1 ...
## $ smoke: int 0 0 1 1 1 0 0 0 1 1 ...
## $ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ht : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ui : int 1 0 0 1 1 0 0 0 0 0 ...
## $ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
Now, I will use SQL to query distinct combination values of
age and smoke from birthwt and
limit the output to just 5 rows
sqldf("SELECT DISTINCT age, smoke
FROM birthwt
LIMIT 5");
Now let ’s practice some queries
sqldf("SELECT race,
COUNT(*) as frequency
FROM birthwt
GROUP BY race
ORDER BY frequency DESC");
smokesqldf("SELECT race, ftv,
COUNT(*) as frequency
FROM birthwt
GROUP BY race, ftv
ORDER BY race");
These commands can help uncover some patterns of the data, e.g., disparities in terms of race, etc