Using SQLite with R

R User Group - September 2013

Jeanne Spicer

Why learn to use a SQLite database?

You already know SQL
A normalized database structure is appropriate for the data
You want to share data with another application like PHP
You want to reduce memory requirements
You are using an R package or script that is database-aware

What is different about SQLite?

SQLite can be called from other software (like R) without requiring any installation
SQLite does not require server administration
Data is stored in a single, cross-platform file
SQLite is in the public domain
Database sizes up to 140Tb are possible

For these reasons, SQLite is considered the most widely used database. Cellphone and other mobile device applications commonly use SQLite.

Packages required for database interaction

DBI - Basic functions to connect to all types of external databases
RSQLite - Functions to embed SQLite into R
RSQLite.extfuns - Include some SQLite extensions
sqldf - Like using SQL queries? Use SQL syntax with R dataframes.

Basic database and table creation

library(DBI)
library(RSQLite)
# This connection creates an empty database if it does not exist
db <- dbConnect(SQLite(), dbname = "./Test.sqlite")
dbSendQuery(conn = db, "CREATE TABLE School(SchID INTERGER, Location Text,
Authority TEXT, SchSize TEXT)")

## <SQLiteResult: DBI RES (843, 0, 1)>

Populate table with SQL commands

dbSendQuery(conn = db, "INSERT INTO School values (1,'urban', 'state', 'medium')")

## <SQLiteResult: DBI RES (843, 0, 2)>

dbSendQuery(conn = db, "INSERT INTO School values (2,'urban', 'independent', 'large')")

## <SQLiteResult: DBI RES (843, 0, 3)>

Populate table from a dataframe object

If your data already resides in an R dataframe...

# Load your R dataframe
load(file = "~/RProject/sqlite/DATA/Class.rda")
# Push to SQLite
dbWriteTable(conn = db, name = "Class", value = Class, row.names = FALSE)

## [1] TRUE

Populate table from csv file

dbWriteTable(conn = db, name = "Student", value = "~/RProject/sqlite/DATA/student.csv", 
    row.names = FALSE, header = TRUE)

## [1] TRUE


dbListFields(db, "Student")

## [1] "studID"  "classID" "Gender"  "Test1"   "Test2"

Check out the functions from the XLConnect package that could read this data into SQLite from a 3 worksheet Excel file.

Useful commands from the DBI package

# For tables
dbListTables(db)

## [1] "Class"   "School"  "Student"

# For fields in a table
dbListFields(db, "School")

## [1] "SchID"     "Location"  "Authority" "SchSize"

Pull data from database table

# With dbReadTable
dbReadTable(db, "School")

##   SchID Location   Authority SchSize
## 1     1    urban       state  medium
## 2     2    urban independent   large

# Or query
dbGetQuery(db, "SELECT * from School")

##   SchID Location   Authority SchSize
## 1     1    urban       state  medium
## 2     2    urban independent   large

Don't forget to close your connection

# Close the connection to the database
dbDisconnect(db)

## [1] TRUE

SQLite Manager for Firefox

Add-on allows you to view SQLite files -- similar to PHPMyAdmin

$Drawing$

If file is huge and you want to manipulate it outside of R, you can install SQLite itelf.

Some R packages use SQLite

The RQDA package leverages the capabilites of SQLite in order to create a database to contain classification codes and metadata required for coding documents for qualitative analysis.

RQDA data is entered via a GUI directly into an SQLite database

RQDA

Structured Query Language

Joins
Subqueries
Functions

Pull data from database with SQL queries

Use R when appropriate

db <- dbConnect(SQLite(), dbname = "./DATA/q64.rqda")
dbListTables(db)
myCodes <- dbGetQuery(conn = db, "select source.name as filename, source.id as fid, C.cid as cid, 
C.code as code from source  inner join (select freecode.name as code, 
  coding.cid, coding.fid from coding 
  inner join freecode where freecode.id = coding.cid and coding.status = 1) 
  C where source.id = C.fid and source.status = 1")

# Reshape into file id (or subject) level.  Could also merge in subject
# attributes and related quantitative information from other data sources.

myPersonLevelFile <- reshape(myCodes, v.names = "code", timevar = "cid", idvar = "fid", 
    direction = "wide")

# Count number of codes per person added as column to dataframe
myPersonLevelFile$nCodes <- rowSums(!is.na(myPersonLevelFile[3:5]))

SQLite and Large Datasets

Using SQLite can reduce memory requirements in cases where you have a very large file but you typically only need a small number of variables at a time.

Use queries to pull in only the variables you need
Perform recodes on disk file when possible using SQL functions

Example -- asdfree.com

Anthony Damico's script stores data from the Survey of Business Owners as a SQLite database. He performs various recoding and subseting functions directly on the database with SQL queries. The final database is over 2GB in size.

Loaded the database with his script 'Download and Import.R'

Ran the analysis in his script '2007 single-year - analysis examples.R'

The `survey` package

Functions in the package can be "database-backed", meaning that you do not need to load the data into your workspace to execute the function.

Weights are loaded into memory and only the variables you need, thereby drastically reducing memory utilization.

library(survey)
library(RSQLite)
sbo.dbname <- "sbo07.db"
db <- dbConnect(SQLite(), sbo.dbname)
sbo.svy <- svydesign(id = ~1, weight = ~tabwgt, data = "x", dbname = sbo.dbname, 
    dbtype = "SQLite")
dbDisconnect(db)

What is the `sbo.svy` object?

Contains the design meta-data: weights, and PSU and strata identifiers and various statistics. The entire data set is 1.4G before processing and 2.25G. The savings are largest when the number of variables is large: the size of the object is proportional to the number of records but independent of the number of variables.

DB-backed Independent Sampling design (with replacement)
svydesign(id = ~1, weight = ~tabwgt, data = "x", 
  dbname = sbo.dbname, dbtype = "SQLite")

$var
Multiple (10) imputations: svydesign(id = ~1, weight = ~newwgt, 
data = imputationList(datasets = as.list(paste0("x", 
    1:10))), dbname = sbo.dbname, dbtype = "SQLite")

Size of objects in 2007 Single-Year analysis

> ll()
                                                 Class     KB
db                                    SQLiteConnection      1
i                                              integer      0
MIcombine.sboimputationResultList             function     29
receipts.by.husbwife                        data.frame      1
sbo.coef                                  DBIsvydesign 304556
sbo.dbname                                   character      0
sbo.pre.y2k                       sbosvyimputationList 558633
sbo.svy                           sbosvyimputationList 543541
sbo.var                              svyimputationList 238985
subset.sbosvyimputationList                   function     14
with.sbosvyimputationList                     function     20

References

Be sure to read the excellent SQLite tutorial by Sandy Muspratt: http://sandymuspratt.blogspot.com/2012/11/r-and-sqlite-part-1.html.

A good general SQL reference is available ONLINE for FREE if you are authenticated from PSU: Allen, G. & Owens, M. (2010). The definitive guide to SQLite (2nd Ed.)

The Survey package and Database-backed Survey Objects by Thomas Lumley: http://faculty.washington.edu/tlumley/survey/svy-dbi.html

For more on the SQLite and the RQDA package, see this video: http://www.youtube.com/watch?v=R2h6EoCpKH8

This presentation was made with the Slidify package and Markdown in RStudio