lec36

Today

Connections
Databases with dplyr

1. Connections

The primary functions to read files in R are scan() and readLines().

readLines() is the workhorse function to read raw text in R as character strings.

scan() is a low-level function for reading data values, and it is extended by read.table() and its related functions.

A connection is an object that tells R to be prepared for opening a data source (eg file in local directory, or a file url). You can think of a connection to a pointer to the file in memory.

Connections can be used anywhere a file name could be passed to functions like scan() or read.table() or writeLines().

Example:

con <- file("example")
open(con, "w")
writeLines("A line", con)
writeLines("Another line", con)
close(con)
readLines("example")

## [1] "A line"       "Another line"

unlink("example")

Connections provide a means to have more control over the way R will “comunicate” with the resources to be read (or written) .When you write or read the file, the connection (i.e. pointer) keeps place of where you are in the file.

The above example requires a connection because we are appending to the file ex.data.

One feature of connections is that you can incrementally read or write pieces of data from/to the connection using the readLines and writeLines functions. This allows for asynchronous data processing, for example when dealing with large data or network connections:

# Read the first 30 lines, 10 lines at a time
con <- file("http://jeroenooms.github.io/data/diamonds.json") 
open(con, "r")
data1 <- readLines(con, n = 10)
data2 <- readLines(con, n = 10)
data3 <- readLines(con, n = 10)
close(con)
data1 %>% head(2)

## [1] "{\"carat\":0.23,\"cut\":\"Ideal\",\"color\":\"E\",\"clarity\":\"SI2\",\"depth\":61.5,\"table\":55,\"price\":326,\"x\":3.95,\"y\":3.98,\"z\":2.43}"  
## [2] "{\"carat\":0.21,\"cut\":\"Premium\",\"color\":\"E\",\"clarity\":\"SI1\",\"depth\":59.8,\"table\":61,\"price\":326,\"x\":3.89,\"y\":3.84,\"z\":2.31}"

But much of the time you don’t need a connection and it is working behind the scenes when you read and write files.

We use a connection when reading and writing to databases with sqlite and other database programs.

2. Databases with `dplyr`

The purpose of databases is to hold large data sets on a server. Generally, if your data fits in memory there is no advantage to putting it in a database: it will only be slower and more hassle.

For example, a library stores details of all their books in a database. When you want to know if a book is in stock you can enter either the title, author or ISBN number and search for information about the book. You can find out how many copies are stored not only by your local library but also libraries in neighbouring towns. You can check when the book is due to be returned and also reserve it.

The database also records details of all the borrowers, what books they currently have out on loan and when they are due back. When they return their books the librarian will be informed if they are overdue and whether there are any fines outstanding.

MySQL and sqlite are two examples of database programs that create and manage databases. SQLite is file-based. MySQL is server based. SQL is a query language used to interact with those and other database programs. MySQL and sqlite use SQL syntax.

Since R almost exclusively works with in-memory data, if you do have a lot of data in a database, you can’t just dump it into R. Instead, you’ll have to work with subsets of your data.

To experiement with databases, it’s easiest to get started with SQLite because everything you need is included in the R package. You don’t need to install anything else, and you don’t need to deal with the hassle of setting up a database server.

The dplyr package is the easiest way to work with databases in R. You guys already have experience with a lot of SQL like commands in dplyr like select, mutate, group_by, summarize, arrange etc. Currently dplyr supports the three most popular open source databases (sqlite, mysql and postgresql) and google’s bigquery.

Examples processing databases already in an R package.

We use the function nycflights13_sqlite() in dplyr to form a connection with the database and the function tbl() in the dplyr package to read a table in the database.

nycflights13 example:

#install.packages("nycflights13")
library(nycflights13)

tbl(nycflights13_sqlite(),"flights" )

## Caching nycflights db at /var/folders/pv/cy37wzfx16v_18m_yfkbzyx00000gn/T//RtmpaVOVB8/nycflights13.sqlite

## Creating table: airlines

## Creating table: airports

## Creating table: flights

## Creating table: planes

## Creating table: weather

## Source: sqlite 3.8.6 [/var/folders/pv/cy37wzfx16v_18m_yfkbzyx00000gn/T//RtmpaVOVB8/nycflights13.sqlite]
## From: flights [336,776 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
## 1   2013     1     1      517         2      830        11      UA  N14228
## 2   2013     1     1      533         4      850        20      UA  N24211
## 3   2013     1     1      542         2      923        33      AA  N619AA
## 4   2013     1     1      544        -1     1004       -18      B6  N804JB
## 5   2013     1     1      554        -6      812       -25      DL  N668DN
## 6   2013     1     1      554        -4      740        12      UA  N39463
## 7   2013     1     1      555        -5      913        19      B6  N516JB
## 8   2013     1     1      557        -3      709       -14      EV  N829AS
## 9   2013     1     1      557        -3      838        -8      B6  N593JB
## 10  2013     1     1      558        -2      753         8      AA  N3ALAA
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl)

tbl(nycflights13_sqlite(), "planes")

## Source: sqlite 3.8.6 [/var/folders/pv/cy37wzfx16v_18m_yfkbzyx00000gn/T//RtmpaVOVB8/nycflights13.sqlite]
## From: planes [3,322 x 9]
## 
##    tailnum  year                    type     manufacturer     model
##      (chr) (int)                   (chr)            (chr)     (chr)
## 1   N10156  2004 Fixed wing multi engine          EMBRAER EMB-145XR
## 2   N102UW  1998 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214
## 3   N103US  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214
## 4   N104UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214
## 5   N10575  2002 Fixed wing multi engine          EMBRAER EMB-145LR
## 6   N105UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214
## 7   N107US  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214
## 8   N108UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214
## 9   N109UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214
## 10  N110UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214
## ..     ...   ...                     ...              ...       ...
## Variables not shown: engines (int), seats (int), speed (int), engine (chr)

```

Baseball example:

You can load and install a baseball database in R

#install.packages("Lahman")
library(Lahman)

## 
## Attaching package: 'Lahman'

## The following object is masked from 'package:car':
## 
##     Salaries

con_batting <- tbl(lahman_sqlite(), "Batting")

## Creating table: AllstarFull

## Creating table: Appearances

## Creating table: AwardsManagers

## Creating table: AwardsPlayers

## Creating table: AwardsShareManagers

## Creating table: AwardsSharePlayers

## Creating table: Batting

## Creating table: BattingPost

## Creating table: CollegePlaying

## Creating table: Fielding

## Creating table: FieldingOF

## Creating table: FieldingPost

## Creating table: HallOfFame

## Creating table: LahmanData

## Creating table: Managers

## Creating table: ManagersHalf

## Creating table: Master

## Creating table: Pitching

## Creating table: PitchingPost

## Creating table: Salaries

## Creating table: Schools

## Creating table: SeriesPost

## Creating table: Teams

## Creating table: TeamsFranchises

## Creating table: TeamsHalf

con_batting

## Source: sqlite 3.8.6 [/var/folders/pv/cy37wzfx16v_18m_yfkbzyx00000gn/T//RtmpaVOVB8/lahman.sqlite]
## From: Batting [99,846 x 22]
## 
##     playerID yearID stint teamID  lgID     G    AB     R     H   X2B   X3B
##        (chr)  (int) (int)  (chr) (chr) (int) (int) (int) (int) (int) (int)
## 1  abercda01   1871     1    TRO    NA     1     4     0     0     0     0
## 2   addybo01   1871     1    RC1    NA    25   118    30    32     6     0
## 3  allisar01   1871     1    CL1    NA    29   137    28    40     4     5
## 4  allisdo01   1871     1    WS3    NA    27   133    28    44    10     2
## 5  ansonca01   1871     1    RC1    NA    25   120    29    39    11     3
## 6  armstbo01   1871     1    FW1    NA    12    49     9    11     2     1
## 7  barkeal01   1871     1    RC1    NA     1     4     0     1     0     0
## 8  barnero01   1871     1    BS1    NA    31   157    66    63    10     9
## 9  barrebi01   1871     1    FW1    NA     1     5     1     1     1     0
## 10 barrofr01   1871     1    BS1    NA    18    86    13    13     2     1
## ..       ...    ...   ...    ...   ...   ...   ...   ...   ...   ...   ...
## Variables not shown: HR (int), RBI (int), SB (int), CS (int), BB (int), SO
##   (int), IBB (int), HBP (int), SH (int), SF (int), GIDP (int)

Batting %>%
  group_by(playerID) %>%
  summarise(total = sum(G)) %>%
  arrange(desc(total)) %>%
  head(5)

## Source: local data frame [5 x 2]
## 
##    playerID total
##       (chr) (int)
## 1  rosepe01  3562
## 2 yastrca01  3308
## 3 aaronha01  3298
## 4 henderi01  3081
## 5  cobbty01  3035

putting a database on your computer

It is possible you already have a database on your computer.

Lets load on our computer the nycflights13 database from the nycflights13 package.

We use the funciton src_sqlite() in dplyr to form a connection to an existing sqlite database or create a new one.

con <- src_sqlite(path="~/Desktop/my_db.sqlite3", create = TRUE)

Here src_sqlite forms a connection, con, of the file on your computer. Since the file my_db.sqlite3 doesn’t exist it will create it. my_db.sqlite3 is called a remote src. Think of a src as a collection of tables (i.e. a database)

You can use src_sqlite(), src_mysql(), src_postgres() and src_bigquery() to connect to the different databases supported by dplyr.

my_db.sqlite3 currently has no data in it, so we’ll load it up with the flights data using the copy_to() function.

library(nycflights13)
copy_to(con, airlines,index=list("carrier","name"))

## Source: sqlite 3.8.6 [~/Desktop/my_db.sqlite3]
## From: airlines [16 x 2]
## 
##    carrier                        name
##      (chr)                       (chr)
## 1       9E           Endeavor Air Inc.
## 2       AA      American Airlines Inc.
## 3       AS        Alaska Airlines Inc.
## 4       B6             JetBlue Airways
## 5       DL        Delta Air Lines Inc.
## 6       EV    ExpressJet Airlines Inc.
## 7       F9      Frontier Airlines Inc.
## 8       FL AirTran Airways Corporation
## 9       HA      Hawaiian Airlines Inc.
## 10      MQ                   Envoy Air
## 11      OO       SkyWest Airlines Inc.
## 12      UA       United Air Lines Inc.
## 13      US             US Airways Inc.
## 14      VX              Virgin America
## 15      WN      Southwest Airlines Co.
## 16      YV          Mesa Airlines Inc.

copy_to copies a table (in this case flights) via the connection con to the remote src my_db.sqlite3 on my computer).

We use the function tbl() to read a table in the database (my_db.sqlite3 or nycflights_sqlite()). Value of tbl is a table object. A table object only print a few rows and all the columns that fit on one screen, describing the rest of it as text.

airlines_sqlite <- tbl(con, "airlines") 
airlines_sqlite

## Source: sqlite 3.8.6 [~/Desktop/my_db.sqlite3]
## From: airlines [16 x 2]
## 
##    carrier                        name
##      (chr)                       (chr)
## 1       9E           Endeavor Air Inc.
## 2       AA      American Airlines Inc.
## 3       AS        Alaska Airlines Inc.
## 4       B6             JetBlue Airways
## 5       DL        Delta Air Lines Inc.
## 6       EV    ExpressJet Airlines Inc.
## 7       F9      Frontier Airlines Inc.
## 8       FL AirTran Airways Corporation
## 9       HA      Hawaiian Airlines Inc.
## 10      MQ                   Envoy Air
## 11      OO       SkyWest Airlines Inc.
## 12      UA       United Air Lines Inc.
## 13      US             US Airways Inc.
## 14      VX              Virgin America
## 15      WN      Southwest Airlines Co.
## 16      YV          Mesa Airlines Inc.

Mysql

Mysql allows you to read data in from a remote database.

Basic syntax:

# To connect to a database first create a src:
con <- src_mysql(host = "blah.com", user = "adam", password = "pass")
# Then reference a tbl within that src
my_tbl <- tbl(con, "my_table")

Although nycflights13 is a reference to data that lives outside of R, you can use the dplyr commands on them as usual. Behind the scenes, dplyr will convert the commands to the database’s native language (in this case, SQL), and return the results. This allows you to pull data that is too large to fit in R: only the fraction of the data that you need will actually be downloaded into R, which will usually fit into R without memory issues.

con <- src_mysql(dbname = "dplyr", 
                  host = "dplyr.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                  port = 3306, 
                  user = "dplyr",
                  password = "dplyr")

nycflights <- tbl(con, "dplyr")
nycflights

## Source: mysql 5.6.21-log [dplyr@dplyr.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
## From: dplyr [336,776 x 17]
## 
##       id  year month   day dep_time dep_delay arr_time arr_delay carrier
##    (int) (int) (int) (int)    (int)     (int)    (int)     (int)   (chr)
## 1      1  2013     1     1      517         2      830        11      UA
## 2      2  2013     1     1      533         4      850        20      UA
## 3      3  2013     1     1      542         2      923        33      AA
## 4      4  2013     1     1      544        -1     1004       -18      B6
## 5      5  2013     1     1      554        -6      812       -25      DL
## 6      6  2013     1     1      554        -4      740        12      UA
## 7      7  2013     1     1      555        -5      913        19      B6
## 8      8  2013     1     1      557        -3      709       -14      EV
## 9      9  2013     1     1      557        -3      838        -8      B6
## 10    10  2013     1     1      558        -2      753         8      AA
## ..   ...   ...   ...   ...      ...       ...      ...       ...     ...
## Variables not shown: tailnum (chr), flight (int), origin (chr), dest
##   (chr), air_time (int), distance (int), hour (int), minute (int)

# We do some data wrangling: Group nycflights data by carrier, then summarise() with two variables: n_flights, the number of flights flown by each carrier and avg_delay, the average arrival delay of flights flown by each carrier. Finally, arrange the carriers by average delay from low to high.

  nycflights %>% group_by(carrier) %>% summarise(n_flights=n(), avg_delay=mean(arr_delay)) %>% arrange(avg_delay)

## Source: mysql 5.6.21-log [dplyr@dplyr.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
## From: <derived table> [?? x 3]
## Arrange: avg_delay

## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as
## numeric

##    carrier n_flights avg_delay
##      (chr)     (dbl)     (dbl)
## 1       AS       714   -9.8613
## 2       HA       342   -6.9152
## 3       AA     32729    0.3556
## 4       DL     48110    1.6289
## 5       VX      5162    1.7487
## 6       US     20536    2.0565
## 7       UA     58665    3.5045
## 8       9E     18460    6.9135
## 9       B6     54635    9.3565
## 10      WN     12275    9.4675
## ..     ...       ...       ...

Basic Verbs

con <- src_sqlite(path="~/Desktop/my_db.sqlite3", create = TRUE) 
copy_to(con, flights)

## Source: sqlite 3.8.6 [~/Desktop/my_db.sqlite3]
## From: flights [336,776 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
## 1   2013     1     1      517         2      830        11      UA  N14228
## 2   2013     1     1      533         4      850        20      UA  N24211
## 3   2013     1     1      542         2      923        33      AA  N619AA
## 4   2013     1     1      544        -1     1004       -18      B6  N804JB
## 5   2013     1     1      554        -6      812       -25      DL  N668DN
## 6   2013     1     1      554        -4      740        12      UA  N39463
## 7   2013     1     1      555        -5      913        19      B6  N516JB
## 8   2013     1     1      557        -3      709       -14      EV  N829AS
## 9   2013     1     1      557        -3      838        -8      B6  N593JB
## 10  2013     1     1      558        -2      753         8      AA  N3ALAA
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl)

flights_sqlite <- tbl(con, "flights") 
flights_sqlite

## Source: sqlite 3.8.6 [~/Desktop/my_db.sqlite3]
## From: flights [336,776 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
## 1   2013     1     1      517         2      830        11      UA  N14228
## 2   2013     1     1      533         4      850        20      UA  N24211
## 3   2013     1     1      542         2      923        33      AA  N619AA
## 4   2013     1     1      544        -1     1004       -18      B6  N804JB
## 5   2013     1     1      554        -6      812       -25      DL  N668DN
## 6   2013     1     1      554        -4      740        12      UA  N39463
## 7   2013     1     1      555        -5      913        19      B6  N516JB
## 8   2013     1     1      557        -3      709       -14      EV  N829AS
## 9   2013     1     1      557        -3      838        -8      B6  N593JB
## 10  2013     1     1      558        -2      753         8      AA  N3ALAA
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl)

We can manipulate the table flights_sqlite with all the usual dplyr verbs like select, filter, mutate, summarise etc.

select(flights_sqlite, year:day, dep_delay, arr_delay)

## Source: sqlite 3.8.6 [~/Desktop/my_db.sqlite3]
## From: flights [336,776 x 5]
## 
##     year month   day dep_delay arr_delay
##    (int) (int) (int)     (dbl)     (dbl)
## 1   2013     1     1         2        11
## 2   2013     1     1         4        20
## 3   2013     1     1         2        33
## 4   2013     1     1        -1       -18
## 5   2013     1     1        -6       -25
## 6   2013     1     1        -4        12
## 7   2013     1     1        -5        19
## 8   2013     1     1        -3       -14
## 9   2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## ..   ...   ...   ...       ...       ...

mutate(flights_sqlite, speed = air_time / distance)

## Source: sqlite 3.8.6 [~/Desktop/my_db.sqlite3]
## From: flights [336,776 x 17]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
## 1   2013     1     1      517         2      830        11      UA  N14228
## 2   2013     1     1      533         4      850        20      UA  N24211
## 3   2013     1     1      542         2      923        33      AA  N619AA
## 4   2013     1     1      544        -1     1004       -18      B6  N804JB
## 5   2013     1     1      554        -6      812       -25      DL  N668DN
## 6   2013     1     1      554        -4      740        12      UA  N39463
## 7   2013     1     1      555        -5      913        19      B6  N516JB
## 8   2013     1     1      557        -3      709       -14      EV  N829AS
## 9   2013     1     1      557        -3      838        -8      B6  N593JB
## 10  2013     1     1      558        -2      753         8      AA  N3ALAA
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl), speed (dbl)

Task for you:

Find the mean departure time, dep_time, of flights_sqlite.

flights_sqlite %>% summarise( delay = mean(dep_time))

## Source: sqlite 3.8.6 [~/Desktop/my_db.sqlite3]
## From: <derived table> [?? x 1]
## 
##      delay
##      (dbl)
## 1  1349.11
## ..     ...

#or
as.data.frame(flights_sqlite)$dep_time %>% mean(na.rm=TRUE)

## Warning: Only first 100,000 results retrieved. Use n = -1 to retrieve all.

## [1] 1344.535

The only differnce between using these data verbs on a database and using them on a data frame is that the expressions in select(), filter(), arrange(), mutate(), and summarise() are translated into SQL so they can be run on the database.

SQL translation

When doing simple mathematical operations of the form you normally use when filtering, mutating and summarising it’s relatively straightforward to translate R code to SQL

To experiment with the translation, use translate_sql(). The following examples work through some basic differences between R and SQL.

# In SQLite variable names are escaped by double quotes:
translate_sql(x)

## <SQL> "x"

#> <SQL> "x"
# And strings are escaped by single quotes
translate_sql("x")

## <SQL> 'x'

#> <SQL> 'x'

# Many functions have slightly different names
translate_sql(x == 1 && (y < 2 || z > 3))

## <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)

#> <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)
translate_sql(x ^ 2 < 10)

## <SQL> POWER("x", 2.0) < 10.0

#> <SQL> POWER("x", 2.0) < 10.0
translate_sql(x %% 2 == 10)

## <SQL> "x" % 2.0 = 10.0

#> <SQL> "x" % 2.0 = 10.0

dplyr knows how to convert the following R functions to SQL:

basic math operators: +, -, *, /, %%, ^
math functions: abs, acos, acosh, asin, asinh, atan, atan2, atanh, ceiling, cos, cosh, cot, coth, exp, floor, log, log10, round, sign, sin, sinh, sqrt, tan, tanh
logical comparisons: <, <=, !=, >=, >, ==, %in%
boolean operations: &, &&, |, ||, !, xor
basic aggregations: mean, sum, min, max, sd, var