- You already know SQL
- A normalized database structure is appropriate for the data
- You want to share data with another application like PHP
- You want to reduce memory requirements
- You are using an R package or script that is database-aware
Jeanne Spicer
For these reasons, SQLite is considered the most widely used database. Cellphone and other mobile device applications commonly use SQLite.
DBI
- Basic functions to connect to all types of external databases RSQLite
- Functions to embed SQLite into RRSQLite.extfuns
- Include some SQLite extensions sqldf
- Like using SQL queries? Use SQL syntax with R dataframes.library(DBI)
library(RSQLite)
# This connection creates an empty database if it does not exist
db <- dbConnect(SQLite(), dbname = "./Test.sqlite")
dbSendQuery(conn = db, "CREATE TABLE School(SchID INTERGER, Location Text,
Authority TEXT, SchSize TEXT)")
## <SQLiteResult: DBI RES (843, 0, 1)>
dbSendQuery(conn = db, "INSERT INTO School values (1,'urban', 'state', 'medium')")
## <SQLiteResult: DBI RES (843, 0, 2)>
dbSendQuery(conn = db, "INSERT INTO School values (2,'urban', 'independent', 'large')")
## <SQLiteResult: DBI RES (843, 0, 3)>
If your data already resides in an R dataframe...
# Load your R dataframe
load(file = "~/RProject/sqlite/DATA/Class.rda")
# Push to SQLite
dbWriteTable(conn = db, name = "Class", value = Class, row.names = FALSE)
## [1] TRUE
dbWriteTable(conn = db, name = "Student", value = "~/RProject/sqlite/DATA/student.csv",
row.names = FALSE, header = TRUE)
## [1] TRUE
dbListFields(db, "Student")
## [1] "studID" "classID" "Gender" "Test1" "Test2"
Check out the functions from the XLConnect package that could read this data into SQLite from a 3 worksheet Excel file.
# For tables
dbListTables(db)
## [1] "Class" "School" "Student"
# For fields in a table
dbListFields(db, "School")
## [1] "SchID" "Location" "Authority" "SchSize"
# With dbReadTable
dbReadTable(db, "School")
## SchID Location Authority SchSize
## 1 1 urban state medium
## 2 2 urban independent large
# Or query
dbGetQuery(db, "SELECT * from School")
## SchID Location Authority SchSize
## 1 1 urban state medium
## 2 2 urban independent large
# Close the connection to the database
dbDisconnect(db)
## [1] TRUE
Add-on allows you to view SQLite files -- similar to PHPMyAdmin
If file is huge and you want to manipulate it outside of R, you can install SQLite itelf.
The RQDA package leverages the capabilites of SQLite in order to create a database to contain classification codes and metadata required for coding documents for qualitative analysis.
db <- dbConnect(SQLite(), dbname = "./DATA/q64.rqda")
dbListTables(db)
myCodes <- dbGetQuery(conn = db, "select source.name as filename, source.id as fid, C.cid as cid,
C.code as code from source inner join (select freecode.name as code,
coding.cid, coding.fid from coding
inner join freecode where freecode.id = coding.cid and coding.status = 1)
C where source.id = C.fid and source.status = 1")
# Reshape into file id (or subject) level. Could also merge in subject
# attributes and related quantitative information from other data sources.
myPersonLevelFile <- reshape(myCodes, v.names = "code", timevar = "cid", idvar = "fid",
direction = "wide")
# Count number of codes per person added as column to dataframe
myPersonLevelFile$nCodes <- rowSums(!is.na(myPersonLevelFile[3:5]))
Using SQLite can reduce memory requirements in cases where you have a very large file but you typically only need a small number of variables at a time.
Use queries to pull in only the variables you need
Perform recodes on disk file when possible using SQL functions
Anthony Damico's script stores data from the Survey of Business Owners as a SQLite database. He performs various recoding and subseting functions directly on the database with SQL queries. The final database is over 2GB in size.
Loaded the database with his script 'Download and Import.R'
Ran the analysis in his script '2007 single-year - analysis examples.R'
survey
packageFunctions in the package can be "database-backed", meaning that you do not need to load the data into your workspace to execute the function.
Weights are loaded into memory and only the variables you need, thereby drastically reducing memory utilization.
library(survey)
library(RSQLite)
sbo.dbname <- "sbo07.db"
db <- dbConnect(SQLite(), sbo.dbname)
sbo.svy <- svydesign(id = ~1, weight = ~tabwgt, data = "x", dbname = sbo.dbname,
dbtype = "SQLite")
dbDisconnect(db)
sbo.svy
object?Contains the design meta-data: weights, and PSU and strata identifiers and various statistics. The entire data set is 1.4G before processing and 2.25G. The savings are largest when the number of variables is large: the size of the object is proportional to the number of records but independent of the number of variables.
DB-backed Independent Sampling design (with replacement)
svydesign(id = ~1, weight = ~tabwgt, data = "x",
dbname = sbo.dbname, dbtype = "SQLite")
$var
Multiple (10) imputations: svydesign(id = ~1, weight = ~newwgt,
data = imputationList(datasets = as.list(paste0("x",
1:10))), dbname = sbo.dbname, dbtype = "SQLite")
> ll()
Class KB
db SQLiteConnection 1
i integer 0
MIcombine.sboimputationResultList function 29
receipts.by.husbwife data.frame 1
sbo.coef DBIsvydesign 304556
sbo.dbname character 0
sbo.pre.y2k sbosvyimputationList 558633
sbo.svy sbosvyimputationList 543541
sbo.var svyimputationList 238985
subset.sbosvyimputationList function 14
with.sbosvyimputationList function 20
Be sure to read the excellent SQLite tutorial by Sandy Muspratt: http://sandymuspratt.blogspot.com/2012/11/r-and-sqlite-part-1.html.
A good general SQL reference is available ONLINE for FREE if you are authenticated from PSU: Allen, G. & Owens, M. (2010). The definitive guide to SQLite (2nd Ed.)
The Survey
package and Database-backed Survey Objects by Thomas Lumley:
http://faculty.washington.edu/tlumley/survey/svy-dbi.html
For more on the SQLite and the RQDA package, see this video: http://www.youtube.com/watch?v=R2h6EoCpKH8
This presentation was made with the
Slidify
package and Markdown in RStudio